In search of BitTorrent files
I have been using BitTorrent quite a bit, but I still find it difficult or boring to find content. There are sites that attempt to aggregate torrent offerings, but it still takes too much time to find something you want. There are even a couple BT search sites out there (Torrent Search Engines / Site Crawlers), but they don't actually seem to work properly or at all.
I am currently implementing a search engine that indexes .torrent files and potentially other P2P entry points (like ed2k: urls for edonkey). A basic harvesting part is functional, so the next steps are the indexing and searching.
The harvesting isn't meant to crawl the whole web, it instead just monitors a set of sites that publish torrent files. This means that some maintenance is still needed to add new sites to the system, so I also plan on adding a ping system (like blog.gs) to allow sites to publish themselves.
I'll post more information when I get the complete prototype working.
Some thoughts on the harvesting so far:
- The html you find on the web is scary. The crawler found some seriously malformed html that renders well in browsers but is a real pain to parse. The worst I have seen so far is unclosed anchor tags within tables.
- Resources need to be balanced: you don't want the crawler to create un-necessary load on the visited server.
An interesting set of slides on the various aspects of building a search engine: Algorithms for Web Indexing and Searching.
Although I need some data persistence, I don't really want to use a full DB solution. I have been looking at Berkeley DB and SMYLE type solutions, but coudn't find their equivalent in .NET so far. I am still looking for a good persistence solution that would allow flexible and fast lookups.
I found a possible storage solution: PERST. I'll post more info on this soon.
Really annoying: the crawler seems to sometime stay stuck on the processing of a page. So far my debugging shows that it waits indefintely in the Close() call on the HttpWebResponse after the fetched body was processed. I happens on random sites that otherwise load correctly in a browser.
Google allows to filter on filetypes. It now supports searching torrent files as well. The freshness won't be as good as suprnova, torrentz or IRC though.
Update: Suprnova and other torrent sites went down. Where will torrents be next? Will some new P2P tools appear to fight back or does the MPAA have bigger problems anyways (memory sticks)?______________________________________
A nice idea: using the html parser from IE to convert html into a DOM ( http://www.jelovic.com/weblog/e117.htm ), rather than re-writing html parsing logic.
I might try this in the crawler.
The best site i found was www.suprnova.org but its currently down, it will most likely be back up in a couple of days, happy downloadingPosted by: Andre (March 6, 2004 12:55 AM) ______________________________________
Just found some scraped RSS feeds for suprnova and torrentz: http://varchars.com/archives/2004/02/32.html Cool!
I know this doesnt really relate to your posting, but I had recently formated my computer, and had forgotten to save a few programs that I used to play the (.ogg) movie files with. I have tryed to locate a media player of some type to play ogg, but have found that ogg files are for audio, not video...I have looked and looked, and still cant find anything that will play the anime shows I downloaded =/. I dont know how the bittorent website got video into these files, but they did, and I at one time had something that could play them.....Someone plz help =(Posted by: Kenneth (November 6, 2004 02:10 PM) ______________________________________
download the "core media player" from http://www.corecoded.com/ ... im sure it will work for you cuz it works for me :)...Posted by: DirtY~DaWg (January 1, 2005 03:51 PM)