Spiders are also known as robots, ant, crawler, bot, worm and automated indexer, spider is the program which runs in a methodical and automated manner for browsing(and capturing) data from World
Wide Web.
Few Definitions:
Seeds: Spider starts with the list of URLs to visit this list is called as “Seeds”
Crawl Frontiers: After visiting Seed URL spider extracts all links in visited page and add them to list of URLs to visit in future, this list is called as “Crawl Frontier”
Process:
Steps of Creating Spider which generates data from web-pages:
- Fetch HTML from live webpage
- Store captured data in local file-system
- Parse stored HTML
- Gather required Data from HTML
Step 1: Fetch HTML from live webpage
After having Seeds the spider will start fetching pure HTML from available (uncrawled) URLs.
This process is not much complex but making it performance concerned requires lots of learning and complexity.
The points to be considered well in these steps are:
1. Check access restriction for page getting crawled (robots.txt)
2. Bandwidth usage of spider host
3. Bandwidth usage for the site being crawled
4. The overall load on the website getting crawled
5. If the page requires authenticated login and/or some cookies available at visitor’s browser to allow visiting, in this case we need to build some functionality to achieve this.
6. Checking of content-type before downloading whole stream. Its possible response might have unwanted content other than our interested HTML (like media streams).
Good article: How To - Write a web crawler in c#
http://www.thecodinghumanist.com/Content/
HowToWriteAWebCrawlerInCSharp.aspx
Uncompleted article from the geek’s corner
http://thegeekscorner.googlepages.com/
csharp_multithreaded_web_spider
Focus on Technology:
For better performance there are few options like multi-threading and asynchronous calls, now which to use or use both together is totally depends on requirement of application. ThreadPool is a thing need to
be concerned while working with multi-threading
Step 2: Store captured data in local file-system
After getting HTML from web-page (seed), we need to store this html somewhere in
local file-system to perform other actions later.
The actions would get performed on this stored HTML are:
- Extract links from this html and build crawl-frontier for next Seed-list
- Extract data required and gather information. Up on this data the tools can be made like search engine, data repository, etc.
The points to be considered in these steps are:
1. How and where to store captured data, either database or file-system.
2. Both database and file-system have its own pros and cons.
3. The data stored should have support of Unicode so multilingual data can’t get affected.
4. The files or data must get deleted after finishing its task; otherwise it tends to
create junk-bin of terabytes of space in little time.
This step itself contains one sub-step called Crawl-Frontier list building: The links available in the page are most likely to be a new seed (aka Crawl frontier)
This step also has some points to be considered like:
1. Links going to be added is referring to external website or own.
2. Link going to be added is allowed to get crawled in robots.txt
3. Is newly added link already available there in seed list or frontier list?
4. File-storage space availability and security.
5. And while using database as storage, database connection pool and some twicks to enhance database operations should be considered.
Focus on Technology:
It seems rather than going for file-system Database Storage is better option as it supports Unicode natively.
For extracting links I’ve used library from codeplex called HTML Agility Pack http://www.codeplex.com/htmlagilitypack, its nice and useful.
Step 3: Parse stored HTML
After getting our required data available locally we have to start working with parsing and capturing required information from particular entry.This parsing requires lots of complexity and learning of best suited parsing method.Parser are never same for all webpage it’s totally specific (to particular webpage or module if site follows same presentation across our required module), so this step is little bit mind stumbling.
HTML Agility Pack is somewhat helpful in this step http://www.codeplex.com/htmlagilitypack
After parsing HTML and getting required data from parsed html, the gathered data should get stored somewhere; this is our step 4.
Focus on Technology:
Rather than starting parsing HTML its better if this HTML get converted to XML first, so it would be easy.
Step 4: Gather required data:
In this step we will actually build our repository by storing data fetched from step 3. Below shown example shows the live scenario how this steps could make our spider work.
Example:
Our goal: extract User’s first name and last name from orkut profile. We will go step by step in our defined steps.
1.Go to his/her profile URL and get output HTML
2. Store this output in our local file-system (e.g. Database or file)
a. Fetch all links available in this page and make crawl-frontier list(totally based on
requirement, this is not a mandatory step)
3. Parse this HTML and locate to the HTML elements which contains our required data
(here) User’s First and Last name (in our this case if once we are able to get
information for one user then it will work for all profile pages)
4. And after getting our required data store these data in database.
So this is all about how spider framework works.