May 27, 2004

Comment spam

I got hit by large waves of comment spam in the last couple weeks. I apologize for any reader that was offended. The spamming got worse after I added the "last comments" section to the front page.

At first I just activated the "Email new comments" MovableType feature and was deleting the spam manually. That quickly became painful.
I then tried the solution of having a "delete comment" link in the email, but for some reason that link wouldn't display correctly in Outlook.


MT blacklist
I'm not sure why I didn't use Jay Allen's MT-blacklist plugin earlier, even though I knew about it. It turned out to be super easy to install. It's been tremendously helpful to me so far.
It also includes a link in the email notification for new comments, but that one works fine in Outlook.

One problem is that I've had to blacklist some very short keywords like "mom", "son" and "rape" which started triggering false positives (for example, a comment with "scraped" got flagged). Having a way to exempt a comment from the filtering would be good.


What's next?
Sifry also wonders about the blog comment spam solutions and the coming arms race.

I think most of distributed moderation systems will come down to two parts: some form of identification and some form of reputation.

The identification could use your home url, a PGP key, a TypeKey or Passport account...

The reputation could be handled centrally by technorati, in the case that Sifry discusses (where technorati binds the comment posts to the original post).
But you could also have a binary reputation using LOAF and blogrolls: you publish a size efficient list of IDs (whatever they are) that you specifically trust and a list of IDs you specifically don't, and you combine these with the lists published by the other blog owners you trust (your blogroll).
The remaining problem is how to handle comments/commenters that don't fall in either category, because they are unknown from the system. These probably still need to be manually approved or at least continue to be filtered.

Update: Get the latest MT-blacklist version, that fixes a flaw regarding escaped characters.

Posted by Julien at 06:39 PM | Comments (1)

May 25, 2004

Origami folding robot

This origami folding robot from Carnegie Mellon (via Roland Piquepaille) has generated a lot of noise recently. Although I find it pretty cool, it seems to have big limitations: it can't perform all the most basic moves from this origami classification (wikipedia). For example, it can't unfold or flip the paper over, let alone perform more complex techniques like the reverse fold (used for simple models, like the pajarita).

Links:
Ned points out Joseph Wu's great origami site.

Posted by Julien at 02:43 PM | Comments (0)

May 20, 2004

Despairing for IIS error logs

This has been bugging me for many months, essentially every time I need to troubleshoot a deployment issue related to IIS.
Where are the damn error logs!?!?

It just drives me nuts that IIS6.0 returns a 404 when an ISAPI extension wasn't correctly enabled (new IIS6.0 security feature) without any information to direct you to the actual source of the problem. There is no event log either.
Same thing with ATL when an unknown handler tag is found or there is a typo that breaks the syntax: you need to recompile with some extra debug flags, to get more info that the default "500".

Apache comparison:
Apache provides a very useful and descriptive error log. Below is a simplified sample of what Apache traces.

[Thu May 20 09:42:05 2004] [error] [client x.x.x.x] File does not exist: /home/y
[Thu May 20 09:42:05 2004] [error] (2)No such file or directory: Incorrect permissions on webroot "/home/y" and webroot's _vti_pvt directory in FrontPageAlias().
[Mon May 17 23:56:10 2004] [error] [client x.x.x.x] Directory index forbidden by rule: /home/y

Errors in dynamic content (CGI, perl, ...) usually show up in there as well, by default, which makes it so much easier to troubleshoot, while staying secure.

These logs can be monitored interactively with utilities like tail with the "-f" option.

Posted by Julien at 04:06 PM | Comments (10)

May 19, 2004

P2P Blockbuster

Here's a random idea I had over the week-end: some software/website to support local swap/trade of DVDs, as a new form of renting.

Most of the DVDs you own sit in your living room un-used for a long time. Why not use them to get to see more movies? Within a neighborhood or a company, you could put people in touch to allow then to trade discs for a certain duration.

This is already something you do with your friends, by informally lending them DVDs that you know they want to see. The idea is to extend your circle of "friends".

To begin, you would need to list the DVDs you would like to share and the ones you are looking for. When a match is found, the trade can be set up. To exchange with a neighbor, you could use mailboxes (both yours and his). At work, you could also use interoffice mail (if company policy allows it).


Open issues and pain points:
Manual input for DVD lists (owned and requested):
This is definitely a bottleneck that needs to be worked on. On the other hand, you don't need to input all the DVDs you own for this system to become useful, especially as more users join. Also, you don't need to completely rely on the automatic request matching, but instead use manual querying or navigation of the available discs (owned by users who are interested in something you have).

Using Amazon's or IMDB's search engines to find the exact DVD based on some keywords could improve this process. Integration with DVD player or catalog software could also help.

Distribution/transport:
How to make it easy to physically swap the discs?
Mail (with Netflix-style envelopes) should not be used, because I would think trust fades with distance.
So the easiest is to restrict your search/swapping to people who are either in your building, campus or neighborhood, depending on your preferences. Also, within a company you could use interoffice mail.

Risks:
If users think the risks (damage, late return, loss, ...) are too high, they won't participate. The main mitigation for this is to have the transactions be symmetric: you borrow from the person you are lending to. Also you know the name of the person you are swapping with.
The software should also support filing complaints, so that you can assess the reputation of a swapper.
Maybe some kind of security deposit could also be used.


More thoughts:
There is a need for cheap content swapping. This approach offers the advantage of being legal, as far as I can tell.
The same principle could be used for books.
I was thinking about CDs as well, but I don't think it works as well (mostly because their usage pattern is probably different than that of DVDs or books).

There something like this should bootstrap nicely, because it may be useful even when there are only a few people and the network effect makes it increasingly interesting to join.

In terms of features, notifications (RSS or email) could be sent when a match is found (A has something B wants and vice-versa) or to remind you of a late return.
Support for queues could be included, to handle concurrent requests for the same item.
The system could also be enriched with ratings of the items and reputation of the swappers.
More options could be offered like choosing for how long you want to swap (I think 2 weeks would be a good default) or improving the search with better geo-location or classifications.


Links:
Thanks to Larry and James, I discovered MediaChest and PeerFlix. Here are some other bloggers' take on PeerFlix: Todd and Martin.

Update:
Just found out about the Distributed Library Project, that supports books, video and music. It is deployed in the San Francisco Bay area.
The idea was also applied to other items, like handbags ;-)

Posted by Julien at 10:02 AM | Comments (13)

May 12, 2004

PDML PHP tricks

Portable Document Markup Language (PDML) is a library to output PDF using PHP and an HTML-like markup language.
The "getting started" page describes how to use it: install the library, create a PHP file starting with <?php include "pdml.php" ?> and start outputting PDML.
The PDML will be automatically converted to a PDF file using the FPDF class.

I was curious how this works, as it seemed pretty different from previous similar libraries (PDFLib, FPDF), that offered a programmatic API to generate PDF DOMs instead of a markup language.

Looking at the very end of pdml.php you can notice a call to ob_start("ob_pdml"), with ob_pdml being defined above as function ob_pdml($buffer).
That turns out to be the key. Check out the documentation for the ob_start PHP function.

ob_start will turn output buffering on and will give the callback function a chance to process it. That's the trick for this seamless custom markup conversion using PHP ;-)

I wonder if Flash could be similarly outputted from PHP using a markup language (that would wrap the Ming library).
There could also be an Excel markup language, based on PHP Spreadsheet_WriteExcel, or an image markup language to generate PNG and JPEG image using the GD graphics library.

Update: Simon and some commenters on his site came up with other uses of this custom markup language technique: running XSLT or Html Tidy on this second layer of processing.
Harry Fuecks mentioned a very cool technique to have Apache automatically prepend the file with the extra PHP tag (using php_value auto_prepend_file myPhpInclude.php in the .htaccess).

I wonder if something like ob_start is available in ASP.Net...

Update:
ASP.Net does support a similar "second pass processing" technique. The "Intercept and modify the output created from a ASP.Net page" article describes how to do that. This relies on setting an output Filter on the HTTPResponse class.
Using inline aspx, it should be possible to have something very similar to PDML. I guess it would even be possible to write server-side controls to output blocks.

Update:
Gzip your CSS using the ob_start PHP trick (and the auto_prepend_file configuration in Apache).

Posted by Julien at 11:10 AM | Comments (6)

May 05, 2004

PageRank and new content

Finding interesting and original content is time consuming. As pointed by recent studies the most popular blogs aren't necessarly the ones with the most original content.

And I agree that is what makes some blogs useful, they act as filters and give you pointers to articles you wouldn't have found. The blogosphere has become a giant neural network that filters/ranks pages that are new or not yet well-known and reconfigures itself by evolving blogrolls.
Although this works, it shows the limitations of PageRank and other machine-based relevancy engines, at least those that run over the whole internet, when it comes to new content.


Sometimes, what you want is to find content that is relevant to your interests, but not part of your daily feeds and not yet popular.
I've been trying Feedster "search feeds" lately: when you perform a Feedster search, you can subscribe to the results using a feed to receive new search results.

So far I've found this a pretty good solution, as it finds new content on sites that you don't regularly read and before it gets ranked by google or the blogosphere. The signal to noise ratio is not great, but not that much worse than on some other human generated feeds.

The main downside is that if you subscribe to multiple searchs on related terms, you'll receive a single item via multiple feeds. But this should be rather easy to solve at the aggregator level, for example by using the permalinks as identifiers.

Posted by Julien at 04:48 PM | Comments (1)

May 04, 2004

Remote Javascript Scripting

Remote scripting is a way to interact with the server, rather that have to refresh the complete page to fetch new data.

For example, RSLite is a simple javascript object that allows to call the server with a certain string and get a response string back. The request is done using a image and the return value is passed using a cookie which is polled.


Remote javascript include:
Useful javascript "libraries" usually come in a separate .js file, that can be included into the html using <script type="text/javascript" language="JavaScript" src="foo.js" />. This is generally used to factorize the code out of the html or a bookmarklet.

The problem with bookmarklets is that IE 6 restricts bookmark urls to 500 characters.
But if all your bookmarklet does is dynamically insert a <script src="http://remoteserver.com/foo.js"> tag in the page, you can avoid this limitation by having most of the code in the remote .js file.

Here are some examples of use of this technique: Jon Udell's quote bookmarklet, Simon Willison's page weight (page weight bookmarklet) and some simple text processing bookmarklets.


Remote scripting via script include:
Remote script includes can be used as another way to interact with the server to achieve remote scripting.
For example, the dChat online chat and my port of dChat to C# will add new script tags dynamically to the DOM to call server-side functions (with a certain calling convention). The server returns javascript that ensures the DOM is cleaned after the call succeeds and can pass some additional return values in the form of client-side javascript calls.


I've also seen this used in a weird way before: you can have the server output the response javascript slowly while keeping the connection open. mod_pubsub uses this trick to delivery updates to various types of clients (flash, javascript,...).

Related:
RSLite's author, Brent Ashley, also wrote the JSRS (Javascript Remote Scripting) library.

Mozilla bug 18843 (dynamically added script not executable).

An explanation of server push via multipart mime-type content.
A server push framework in Java, based on javascript and maintaining the connection open to a servlet (pushlet).

OpenThought framework (javascript and iframes).

A project using client-side JavaScript to remotely invoke methods in ASP.NET pages.

Posted by Julien at 03:08 PM | Comments (3)

May 03, 2004

C# port of dChat

I ported dChat, an IRC-like web chat , to C#. Try the online demo of the original dChat.

My goal was to make it really easy to use on windows/IIS in a windows domain. So, no compilation is needed, as you can run it by just deploying the source files to an IIS webserver and configuring the "Directory Security" to only use "Integrated Windows authentication".
You can then load the url for chat.aspx in a browser and start using it.

A useful configuration is to add chat.aspx to the "Default Document" list, so that if you load the url for the directory instead, it'll default to dnChat.

One of the differences with the dChat is that there is no DB or persistent storage (at least yet).
Another difference is the use of windows authentication.

Get the zipped source files for dnChat.

The files:

  • chat.aspx: is a mostly static html file that has includes javascript remoting tricks to communicate to f.ashx.
  • f.ashx: a C# HTTPHandler that handles the remote function calls from the browser.
  • online.js: some utility javascript functions. Some of the javascript from chat.aspx could be moved in there too...

Note: The .Net framework needs to be installed on the machine in order to run C#. No need for the SDK, the redistributable is enough. Also IIS needs to be configured to use it (which should automatically get setup when installing .Net), but in case it is not, you can run "aspnet_regiis -i" to fix it.

The poll rate is currently set to 3 seconds, but it should be easy to make it adaptative: the server could set it to 10 or 15 seconds if you haven't received a message for a long time, but when the channel is active it could go back down to 3 seconds.

Update: Since a couple of people showed interest in extending dnChat, I started a dnChat Sourceforge project with Weston Weems.
Also, Mesalem posted a variation on dnChat at the Code Project, check out "Simple Chat Application in ASP.NET".

Update: ASP.Net 2.0 comes with a remote scripting feature: client callbacks (via).


Update (2005/05/11): Fixed an issue with unicode characters not being transmitted properly. Now using encodeURIComponent instead of escape for unicode characters on the querystring. Updated the zip file above and on Sourceforge.
Thanks to Vitor Cardoso for pointing the issue out and troubleshooting with me.

Posted by Julien at 12:51 PM | Comments (15)