The search engine startup Cuil (pronouced "Cool") we first told you about in July isn't very "cool" in the way its indexing robot works with websites. TechCrunch reports that Cuil's Twiceler website crawler is bringing many websites to their knees.
What is Twiceler doing? Last year, posters on The Admin Zone forum on Twiceler pointed out that the crawler was creating many connections in a short amount of time, resulting in an de facto denial of service "attack" on sites being crawled. While Twiceler doesn't work the same way now, it's still behaving badly.
For example, the JazzyChad blog reported recently that Twiceler was indexing invalid addresses that would become 404 (file not found) errors when Cuil users tried to follow them. Joe Kirp's Popular Science and Technology blog reports that:
The Twiceler bot is probably the most stupid crawler I've ever seen, it just downloads everything it can find and it seems that it just won't ever stop. If there's a page using dynamic input in a URL (a calendar for example) it will download the same page 100,000 and more times, simply by following all kinds of dynamic links it can find without using any kind of intelligent limitation.
By downloading thousands of pages per hour on each website it can cause an incredible traffic on a server, and dynamic scripts (written in Perl, Python or PHP for example) start causing an immense CPU load that may even take your entire server down (as reported by several webmasters). Twiceler is really harmful and can cost both money and downtime. A well written crawler such as Googlebot or Slurp (Yahoo) would never affect a website in such a malicious way.
How can you stop Twiceler from bringing your website to a crashing halt? While Cuil claims on its webmasters' information page that Twiceler obeys normal instructions in a web server's robots.txt file (a commonly-used method for directing web search robots to index or ignore specified parts of a site, or all of a site), many frustrated webmasters, such as Alex Higgins, have discovered that Twiceler blows off normal 'do not index' instructions. As Higgins puts it:
Cuil’s Twiceler bot would not obey my robots.txt file. Attempts to make it go away by sending it blank responses with 404 (Page Not Found), 500 (Internal Server Errors), and even 403 status codes (Access denied) went unrespected....I then banned the Cuil spider’s IP address. The it started using differnt IP’s and and cloaked its identity by not sending its usual User-Agent header (Mozilla/5.0+(Twiceler-0.9+http://www.cuill.com/twiceler/robot.html).... Basically, it would find a link to a given URL, for example http://blog.alexanderhiggins.com/topics/blogging , and woudl begin hacking the url into different parts looking for hidden directories. Using the above url as an example, it woul begin crawling the /topics directory like this:
Eventually, it would repeat the processe for chopping up /topics itself. This bot was so bad that I had to programatically listen for malformed requests and ban its IP Address on the fly to prevent it from crashing my server again.
If Cuil can't make Twiceler behave, can you? Yes, you can. JazzyChad provides a simple addition to the Robots.txt file that will send Twiceler on its way.
If you're not a webmaster, what should you make of all this? Cuil's website indexing technology doesn't seem ready for prime time, and given its flawed methodology, how many of the over 121 billion web pages Cuil claims to search actually exist? You'll have to decide that one for yourself.