Curiosity is bliss    Archive    Feed    About    Search

Julien Couvreur's programming blog and more

Information wants to be decentralized


One of the top limitations with computers today (along with search) is the problem of synchronized information between multiple machines, described earlier in "Private Information Network".
I read about other people reporting the same need all the time. Just in the last couple of weeks, Richard mentioned synchronizing bookmarks, Jon Udell wrote about keeping the devices synchronized, so they are interchangeable, Wired listed Make networked home PCs back each other up in its "101 Ways to Save the Internet" list and Tim discussed various solutions to keep feeds and mail synchronized.

In the past, I have relied on a server based approach to handle a subset of my information: bookmarks (using a custom bookmark manager + a bookmarklet), RSS feeds (using bloglines) and email (using Outlook Web Access). But this feels quite limited and brittle, because it is web-based and centralized.

Local vs. web-based review:
+ fast/rich UI, control
- access, backup

+ access, setup, aggregation of data across multiple users (PageRank, recommendations, ...)
- slower/simpler UI (likely web-based), dependency on service provider, cost, backup, resource efficiency

+ access, control, resource efficiency
- slower/simpler UI, setup, maintenance, backup

List of the dimensions considered:
access: having access to your information from anywhere,
UI: responsive/rich vs. slower/web-based,
control: choice and customization of the software,
backup: risk of loosing the data (hard-drive crash, service provider goes bankrupt),
setup & maintenance: hassle and technical difficulty to set up and maintain,
resource efficiency & cost: exploits mainly un-used resources vs. new infrastructure is needed.

The major trade-off is getting a slower, web UI in exchange for better access. But why isn't there a way to get both most of the time?
Also, even the centralized solutions carry risks of losing the data, as it is still in only one location.

Why not move toward a decentralized application model, where the data is kept on multiple machines, accessible and secure?
For example, my various machines (home, work, laptop, PDA,...) would stay connected via a private P2P network, which would allow local caching and remote synchronization of the data. Some form of web access could be also be useful, in the case you want to access some information from a friend's machine.

This model is based on assumptions: hard drive space is cheap and lots of it is wasted, the most important/valuable information isn't very big, and the network is fast and often un-used at its full capacity.

+ access: your information is available either by replication to most of your machines, on the fly caching or web access,
+ fast, rich UI: the data is cached locally for rich interaction,
+ control: you can pick and switch applications easily,
+ backup: the distributed replication ensures a natural backup,
+ resource efficiency: local caching helps limit the network usage, wasted hard drive space is used,

- replication time: if the machine with the latest version of an item didn't replicate before going offline, you only have access to an older version,
- setup: machines must be added to the private P2P network one by one, replication configuration can be tricky,
- synchronization conflicts: because there is multiple copies of the same piece of data, any changes made offline need to be merged and sometimes manual intervention is needed to solve conflicts,
- resource efficiency: data replication isn't the most efficient use of storage, synchronization of a number of machines isn't network efficient either.

Recommendations for a framework:
P2P connectivity: some machines may be behind NATs or firewalls. Using a P2P topology helps restore connectivity between the machines of the private network.
API based synchronization: file-level synchronization makes it difficult to handle conflicts. If applications run on top of a changeset management API then it should be easier. But conflict resolution still seems like the toughest problem in such a decentralized architecture.
Synchronous/asynchronous replication: in some cases replication can occur on-demand, when I request a local copy of a file, but it can also happen in the background, either at a schedule or when the machine is idle.
Network efficient: not all data should be mirrored to all machines. Data should be transferred as directly as possible between machines.
Secure: information stored in the private network needs to be secured against un-authorized access. This may be extended in the future to support file sharing between private machine networks (say friend to friend).
Web bridging: a fall back solution should be provided to access from a machine outside of the network, via a web interface and maybe an applet.
Streaming support: why transfer a DivX to a local machine before viewing it? If the network supports it, the player should stream the media over the private network.
Metadata replication: do you really need to replicate all your MP3s? But you still might want to back them up one way or another. Replicating the metadata (file hash, filename, MP3 title,...) is enough to recover the content from the internet for some kind of files.

InterMezzo filesystem.
Synchronization of Information Aggregators using Markup (SIAM) and the challenges of synching..
WinFS synchronization.
The Dangers of Replication, and a Solution (by Jim Gray and others).
A list of version control systems (darcs and monotone seem really interesting, since they seem to manage decentralized merges).
Weblications: a great summary of the evolution of applications toward the web. Google has made pretty responsive and rich web UIs based on the centralized model. I still believe a decentralized model with caching will appear and disrupt these large service providers, but we're not there yet.


"two weeks" :)

I just wanted to add, the 'web=poor UI' vs 'local=rich UI' may come to an end, thanks to more and more people coding web services (even if I'm quite disapointed when I try to find some free ones around... but thanks Amazon, I'm using theirs quite a lot)
Of course, it's still slower than real local stuff, but with adapted user flow and caching policy... good enough.

Beside that, a StreamAgent, an auto-synchronized folder, and some custom apps based on this common API should meet most of people needs.

We'll talk again in two weeks then ;-)

Posted by: KiniK (February 10, 2004 11:20 AM) ______________________________________

I'm not sure 2 weeks is nearly enough to implement this ;-)

Posted by: Dumky (February 10, 2004 12:14 PM) ______________________________________

Rhino and Newton, man. Rhino and Newton.
That's all I'm sayin'...


Posted by: sk (March 2, 2005 11:46 AM)
comments powered by Disqus