The web is an ocean of information containing more than 10 billion web pages, wherein 90% of them are in non-structured or semi-structured formats. At present, it is expanding with an increasing rate of 1 million pages per day. The information is increasing at an explosive speed while people’s time and energy are limited. The information absolutely valuable for enterprises or individuals is just lying in this worldwide ocean of the Internet, and how to extract them has become one of the most imperative tasks confronting the research institutions that are engaging the important topics of Information Retrieval, Data Mining, Knowledge Management and Competitive Intelligence etc.
The Blue Whale Web Data Extraction System(BWDES) is like a huge blue whale who cruises in this information ocean everyday and is capable of automatically and accurately extracting valuable information for you from the webpage ocean wherein a multitudes of useless messages (such as page headers and fo