Answers

 

Prakash K

Sr. Design engineer at LSI

see all my questions

Extracting data from search query using a bot (program)

I need a way to extract query results from sites that provide results based on a search query. It would be great if all this data can be stored to a file for later viewing. Can this be done? Can I use python or perl? What if the site needs a login and password? Are there ways to provide this as a inputs to the automated script?

posted 4 months ago in Web Development | Closed

Share This Question

Share This

Good Answers (3)

 

William B

Career Developer - Focused on Constructivist Process

see all my answers

Best Answers in: Career Management (8), Job Search (4), Web Development (4), Education and Schools (3), Mentoring (2), Software Development (2), Certification and Licenses (1), Staffing and Recruiting (1), Interface Design (1), Enterprise Software (1)

This was selected as Best Answer

Mr Krishnamoorthy,

You might find that stackoverflow.com is a better place to post questions like this. However, here's the gist with python.

Use urllib2 to GET or POST your query to the search engine, with a login and password, if necessary, and to recover the results. If authentication is involved together with multiple pages of results then life will become more complicated.

Use BeautifulSoup to extract information from recovered pages and to determine whether more pages in a sequence of searched results are available (for example, a situation like 'page 1 of 5').

You can pass the query and login information to the script as required.

Alternatively you could try the Python product know as Scrapy. Although I've done quite a bit of page scraping I haven't had to resort to it yet.

All the best.

posted 4 months ago

 

Ken W

Lead Software Developer at Coalmarch Productions

see all my answers

Best Answers in: Web Development (2), Software Development (1)

Look into wget if you're on linux / mac.

posted 4 months ago

 

Leonid L

Software Engineer at Linedata Services

see all my answers

Best Answers in: Using LinkedIn (9), Education and Schools (4), Mentoring (3), Government Policy (3), Software Development (3), Customer Service (2), Car and Train Travel (2), Starting Up (2), Event Marketing and Promotions (1), Compensation and Benefits (1), Treaties, Agreements and Organizations (1), Internet Marketing (1), Graphic Design (1), Public Relations (1), Organizational Development (1), Planning (1), Bond Markets (1), Currency Markets (1), Quality Management and Standards (1), Supply Chain Management (1), Individual Insurance (1), Wealth Management (1), Market Research and Definition (1), Product Design (1), Ethics (1), Small Business (1), E-Commerce (1), Enterprise Software (1), Computers and Software (1), Computer Networking (1), Databases (1), Information Security (1), Web Development (1)

Prakash,
Geoff is hard on the noobs, but he is actually a nice guy, as I learned.

Like other respondents suggested, you need either GET or POST and you need to understand the syntax that the website accepts.

You should be able to use Python/Perl/Java/Ruby//C++/C/.Net ... it's just a matter of finding the right library.

In the unlikely event that the site is so obfuscated that you can only get what you need by clicking (more likely, you could not figure out it's protocol), you can pull out a big and expensive gun: Ranorex library. It will do the clicking and typing for you.

posted 4 months ago

More Answers (3)

 

Ewan N

Project Leader Telecommunications at Philip Morris International IT Service Center

see all my answers

Best Answers in: Mobile Marketing (1), Search Marketing (1), Organizational Development (1), Ethics (1), Computers and Software (1), Web Development (1)

Prakash,
At the risk of sounding like Microsoft Office Agent, you appear to be writing a search engine.

Maybe the links below are of interest

regards
ewan

Links:

posted 4 months ago

 

Geoff F

"Hands-on" Software Architect and Senior Developer

see all my answers

Best Answers in: Computers and Software (20), Software Development (17), Web Development (15), Enterprise Software (11), Wireless (6), Blogging (5), E-Commerce (5), Information Storage (5), Telecommunications (5), Offshoring and Outsourcing (3), Biotech (3), Starting Up (2), Computer Networking (2), Information Security (2), Occupational Training (1), Internet Marketing (1), Graphic Design (1), Search Marketing (1), Planning (1), Non-profit Management (1), Professional Organizations (1), Databases (1), Using LinkedIn (1)

Are you actually thinking about what you write?

How can an outside agent know what a query string for a site will look like before it is executed? Only in a world of magic and not deterministic programs is this possible.

posted 4 months ago

 

Steve G

Principal, Site Tuneups Web Development

see all my answers

Best Answers in: Web Development (7), Blogging (3), E-Commerce (2)

Goeff, I think he wants to provide the query. It should be straightforward enough to POST a search string to the engine and store the server response. It seems to me that the trick is in understanding the syntax of the search string needed, including what process to call. If the spider has to figure that out on the fly, it could get complicated

Clarification added 4 months ago:

I meant Geoff, sorry about that

posted 4 months ago