[Framework-Team] Plip : indexing files

Thierry Benita tbenita at atreal.net
Fri Mar 16 19:49:12 UTC 2007


Hi Martin,

We are completing a new version of ARFilePreview, totally based on a
five approach. It is available in collective trunk ; older versions are
in branches.
It provides preview on a normal file object by adapting it. It also
indexes the file's content (some work is in progress in order to install
it properly).

This new version is about to be finished, but it gives a good overview
of what we want to do and how we intend to achieve it. We now have
issues that are related to the ATCT File content that doesn't handle
many kind of events.

We have a regression compared to last version : ATFile has bad handle of
events after PUT, no rules (usecases ?) defined when a file is renamed
via webdav, etc... All this work is done in the AT version of
ARFilePreview.

Should we work on ATFile ? Would someone join us in this task ?

Best regards,

Thierry.

Martin Aspeli a écrit :
> Hi Thierry,
>
> I think this sounds quite interesting. Certainly, a better "document"
> story (which includes full-text indexing and a strategy to avoid ZODB
> bloat, e.g. blobfile) is pretty high on my wishlist for 3.5 (and
> limi's as well, fwiw).
>
> I would like to see a proposal that is somwhat less AT centric,
> though. It may be wishful to think that we can achieve this, but
> ideally we'd decouple portal_transform entirely, replacing it with a
> lighter framework based on Zope 3 adapters and utilities (a transform
> is a utility, adapters take care of the actual extraction of data to
> transform and consumption of the transformed text). This should also
> allow some async option (register a consumer for the transform that is
> called when the transform is complete).
>
> At this point, we could extend ATFile relatively easily to use this. I
> don't think we'd want a new content type, but rather to extend ATFile
> as necessary.
>
> I think BLOB storage and transform should be two separate proposals
> and two separate implementations.
>
> Martin
>
> On 1/29/07, tbenita at atreal.net <tbenita at atreal.net> wrote:
>> Hi,
>>
>> I'd like to make a proposal that extends Plip #177
>> http://plone.org/products/plone/roadmap/177
>>
>> We developed a plone component that stores a file with its html
>> preview :
>> ATFilePreview .
>>
>> This does the following :
>>
>> - make the file available for download
>>
>> - create a html preview of the file
>>
>> - index the file's content in full text
>>
>>
>> It has the following advantages :
>>
>> - it uses mimetypes registry in order to detect mimetypes
>>
>> - it uses portal transforms in order to create the preview and uses this
>> preview in order to extract the text that has to be indexed
>>
>> - it stores both html preview and all subobjects into the object, as
>> persistant sub-objects
>>
>> - it's totally generic : obviously it does preview and indexes for
>> opendocuments, ms documents, pdf, rtf, html, python etc. It may also
>> show
>> a preview for zip files, video files, audio files or whatever you can
>> imagine. Let's take the example of a video file : you may decide that
>> all
>> video that is uploaded will be transcoded to mkv format and streamed in
>> the page via a java applet that displays the video. You only need to
>> have
>> a video_to_html transform that will achieve it. The result will be
>> stored
>> together with the original file and the html preview will be displayed.
>>
>> - the trunk (it's in collective) stores everything inside the object in
>> zodb, so it has no dependency and can take place of normal file objects
>>
>> - there is another version that stores file, html and subobjects in the
>> filesystem. It currently uses FSS but we'd like to move that to BlobFile
>> as FSS is a bit too complex for our usecase.
>>
>> - we don't need all the TING mechanics in order to get the fulltext
>> indexing : we only need the UnicodeLexicon as far as portal transforms
>> send unicode results (tested in france ; you can imagine ;-) )
>>
>> - we already have the transforms for all office files in
>> AROfficesTransform, for which we are currently doing the integration
>> into
>> archetypes.
>>
>>
>>
>> At this time there are 2 new things to consider :
>>
>> - portal transforms may overload the zope server
>>
>> - there may be decorators that should be applied to files in order to
>> handle properly specific extra fields (especially for multimedia files :
>> metadata etc.)
>>
>> * Concerning overload of zope server : I think that we should have an
>> asynchronous portal transform that may run as a separate twisted deamon.
>> This may live together with portal_transforms and may be called
>> asynchronous_portal_transform (APT). The only difference with
>> portal_transforms is that we need to give a callback method to APT in
>> order to allow it to send the result of the transform after a while.
>> Therefore if a content type is APT-aware and APT is activated, APT is
>> used
>> instead of portal_transforms. This allow to move the overload to one or
>> many dedicated servers for example. We may also take a look at
>> BlueDCS (I
>> just heard of it but never tried it)
>>
>> * Concerning the decorators : there should be a kind of
>> decorators_registry that would allow to add decorators based on
>> mimetypes
>>
>> What do you think of all these points ?
>>
>> Best regards,
>>
>> Thierry.
>>
>> -- 
>> atReal
>> http://www.atreal.net
>>
>>
>>
>> _______________________________________________
>> Framework-Team mailing list
>> Framework-Team at lists.plone.org
>> http://lists.plone.org/mailman/listinfo/framework-team
>>
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tbenita.vcf
Type: text/x-vcard
Size: 436 bytes
Desc: not available
URL: <http://lists.plone.org/pipermail/plone-framework-team/attachments/20070316/28e9be9e/attachment.vcf>


More information about the Framework-Team mailing list