Storage, Filesystems and Information retrieval
Recently, I have been reading quite a bit about filesystems and their current and future development. Their seems to be at least three major areas of interest: performance, user interface and distribution. We'll see various projects along these axes and in particular the search engine backed user interface, that might set the trend for some time.
Being no filesystem or information retrieval expert, all corrections or additions to this storage survey are welcome.
Performance and core features
Performance is a domain specific measure, as different applications will have very different data access patterns. The resources (HDD, RAM, CPU, network,...) directly relate to the trade-offs and determine the criteria for evaluation: response time/latency, throughput, resource efficiency (space on HDD, CPU utilization),... All these aspects make it difficult to say that a filesystem is "better" in absolute terms, especially in the changing landscape of evolving technologies (larger and faster components). It is even harder if you add to that different data models (what metadata is stored) and feature sets (whether transactions, indexing or replication is supported).
Here is an interview with Josh Gray on storage.
He compares the evolution of storage capacity versus that of access times, and thinks architectural changes will be occurring to handle the huge capacity disks, because linear access (like tapes) is faster than random and can be improved by using larger chunks.
He also discusses TeraScale SneakerNet that are basically PCs full of disks that he mails around. Apparently this is cheaper bandwidth than network transfers, but doesn't detail whether this trend might change in future.
Personally, if I had a TeraByte drive, I'd be really concerned about having it crash. On the other hand, since I'd probably use a good part of that storage to store music and films, I probably could backup the metadata (title, size, signature,...) by replicating it onto distributed machines. In case of a crash, I'd use this metadata to recover a copy from somebody else. Of course, any original or important files would have to be mirrored entirely, instead of just shadowed. This would fit nicely in the private information network that I tried to explore recently.
The Conquest filesystem is a very interesting project (mentioned on slashdot a little while back) that brings another component into the equation: persistent RAM. Basically, the filesystem uses some non-volatile/battery-backed RAM both for storing a subset of files and provide some caching to the hard drive.
The cost of the RAM makes it possible to economically store the metadata, small files, executables, and shared libraries, while the larger files that aren't as frequently used and/or don't benefit as much from fast access times can be stored on the slower medium.
It seems that this approach could benefit lots of other filesystems if not all, as there doesn't seem to be constraints on the structure of the filesystem itself.
Kur5hin has an article that presents some of the features of Reiser 4. Reiser manages to push the envelope without using new hardware: better performances, improved efficiency for small files, while including features like journaling and atomicity, and exploring a new data model (entries are both files and directories), at the expense of a higher CPU usage.
An important question is whether the non-transparent features (like the new data model for files and directories) will be adopted. Also, how well do they integrate in cross-platform scenarios: if you send a file that has metadata attached in ReiserFS, is the metadata somehow packaged or is it lost when read by a non-ReiserFS system?
User interface: information retrieval
With the mass of information that flows by us each day (mail, blogs/feeds, web, ...), it would make sense for computers to do a better job at helping us finding relevant information. There seems to be an important push in this area these days.
For example, X1 is a file and email indexer and search engine. ZOE (via O'ReillyNet) is an open source program that follows a similar approach for emails only.
Microsoft backs this shift to a search oriented interface, in Longhorn, with winFS, that will handle mails, contacts, calendar events as first order objects and index them (apparently using the Yukon engine on top of NTFS).
Gnome also has two very promising projects: Storage (mentioned here on slashdot), a SQL backed storage with a VFS adapter, and Medusa, a user-level indexing application.
The Seruku toolbar for Internet Explorer records all the web pages that you visit (but there is a configurable exception list) and lets you query them by keywords.
MyLifeBits MS Research project (described on Wired here) combines the Seruku web recording with other data sources (files, emails,...). But the program doesn't seem to be publicly available.
Dashboard brings two interesting variations to the projects listed so far: it's architecture is extensible to support multiple data sources (like the Evolution mail client and the Epiphany web browser) and displays the information interactively (based on context clues sent by the applications you use).
I'm not sure the continuous search will suit me, but the common infrastructure for the indexer is certainly very exciting. I also look forward to having more information used, like how I got to a certain page or who sent me a link.
There are many different approaches to making your files available in a distributed system.
The CODA filesystem is an interesting one, with support for disconnected operations. The files are cached on the replica (that may be a laptop for example) for fast access and to allow edition while the computer is online or disconnected from the master. The cache is configurable and the user can force some files to be cached/replicated. Conflicts can be resolved by application specific resolvers or manually.
Harmony on the other hand doesn't have a master and tries to keep all replicas in synchronization.
Another approach to making your files ubiquitous is to carry something like Intel's personal server around with you.
What I like about this device is that it can act like a high bandwitdh channel between machines that can't connect to each other (maybe one is off when the other is on).
But in the pure server model, I don't think it brings much. Because the server moves along with you the file access might be a bit faster (local) than with a regular server (on the internet), but not that fast because of the use of a wireless link. Also the hard drive won't be large enough to store all your files, you'll have to have a separate backup solution and a battery is always trouble.
Apparently Longhorn's winFS will also have facilities for replicating/distributing your files over on multiple machines, but not many details are publicly available yet.
The Chandler project also wants to distribute your information, but it isn't clear whether it will be at the filesystem level or managed by the application.
The "New Advances in the Filesystem Space" paper (by Grant Miner and mentioned on slashdot here) makes a great case for the namespace unification and allowing a uniform handling of attributes and metadata.
Lambda the Ultimate on the USENIX2003 talks: a logical file system.
TrailBlazer, a cool MacOSX browser history tool that shows how you navigated from page to page.
The "Underground Economist" wonders about personal search engines, revolving around bookmarks, browser history and blogs.
FURL is another stab at this, based on special bookmarking and search through cached pages.______________________________________
Speaking to your comments on information retrieval, my company is about to go beta on an Internet companion application that runs on top of and integrates with a user’s web browser, email software and email address book, enabling users to make saving, organizing and sharing (“SOS”) a fulcrum of Internet productivity.
The idea is that whereas most search-related activities end up being relegated to the 'get it and forget it' bucket, with SOS baked into the online realm, three productivity leaps can be realized in terms of managing information of interest that's found on the Internet:
1. "Saving": No matter where you are and what you are doing on the Internet—be it reading email, performing a search or browsing the web—when a given web page or email inspires you, you can call up and capture both the online content and any products or businesses referenced in it quickly and non-disruptively. We call this discovery mode, and have an underlying construct that we call actionable listings to enable you to actually “do something” with the information when you find it.
2. "Organize": Through support for automatic and manual associations, modalities and real-time filtering capabilities, you can increase the context, relevance and reusability of product or business listings, supporting web content and related emails. This is intended to address the paradox of whether to bookmark, email to self or save to local folder, and then what to do when a given piece of information is applicable to multiple contexts.
3. "Share": Share the personalized lists you create of products, businesses and supporting content easily with a multi-selection list view. Choose the recipient or recipients using your email address book, and with a click, convert it to an email ready web page and you're done.
The basic product is a free download, although there will be a paid version of the software targeted at corporations and participants in associates programs. If you want to find out what exactly we are doing and why you would want to use the software, there are whitepapers and overview documents on the site (www.verdada.com). Or, wait until the start of October when the public beta begins.Posted by: hypermark (September 23, 2003 11:40 AM) ______________________________________
The roadmap slides for Mozilla 2.0 (http://mozilla.org/events/dev-day-feb-2004/mozilla-futures/ ) lists some end-user innvoations related to this:
-Personal intertwingled data server
-Shared web page annotations