[Setup] Re: Zope/Plone scalability

Sun May 11 10:40:09 UTC 2008

This post is generously snipped to avoid more agitation.

> As I asked before, is there some, more recent document, which
> highlights how to structure Plone for a large scale deployment? If
> there is, I'd sure like to read it ...

If you have thousands of users or tens of gigabytes of data, then it 
normally pays to get some expert help in to assure your architecture. 
That goes for any platform, Plone included. The basics are fairly well 
known: You have a ZEO server, you have multiple ZEO clients (at least 
one per processor core), you have load balancer in front of those 
clients (e.g. pound), and you have Varnish or Squid for caching, with 
CacheFu correctly configured in Plone.

> Our main site keels over frequently when it was on our Zope/Zeo
> cluster. On its own box, it only restarts itself 1-5 times per day.

This is still unacceptable - I'd never run a site that behaved like 
that. I don't think many people would. Most Plone sites don't do this, 
so there must be something in your setup that could be improved or 
fixed. I can't be more specific than that, though.

> What do you mean by "instance"? Install Zope (n front ends plus ZEO
> backend). Add Plone site. Repeat x 20.
> 
> That's the setup I'm talking about. As far as I can tell, it's the
> most natural way to add different websites (e.g. Plone sites with
> different URLs). It also turns out not to be very performant. (Note:
> there's only one ZEO box, with one data.fs, which for us is <2GB)

I don't think that "a priori" having 20+ small Plone sites in one Zope 
instance is a drag on performance, except that you're duplicating some 
things (like the catalog) that may add a bit of overhead compared to 
having one site that's 20 times as big. I suspect the performance and 
stability issues you're seeing have more to do with what's going on 
inside one or more of those sites.

This is where you need to learn how to debug things. Your logs will tell 
you when something crashes. From your previous traceback, it looks like 
something in RedirectionTool (which, by the way, is not a core part of 
Plone). Have you tried to uninstall this tool temporarily (take a 
backup!) to see whether the problem goes away? Have you tried to ask on 
the mailing lists what the problem is? Have you tried to get someone 
with Python skills to do some debugging on the line that's causing the 
error? Have you tried to understand what triggers the error - is it 
happening on any 404 page, for example? Or on all pages? Or on pages 
when a redirect alias is being invoked?

> Install IIS/Apache. Add sites x 20, with host header redirection. No
> problems. Add dynamic elements (.NET, modPHP). Still no problems.

That's not a valid comparison. Add 20 advanced content management 
systems written in .NET or modPHP, fill them with the same content, and 
then come talk to me.

This kind of talk is fairly pointless, though. You're having a whinge. 
If you want to have a whinge, go ahead, but then I'll stop wasting my 
time trying to figure out what your problem is and give you advice. If 
you want to get advice, then you're much more likely to get it if you 
adjust your tone to seem less combative.

> (Again, thanks to Raphael Ritz for the ZEO Raid suggestion, which I'm
> trying out on a separate box, though the Subversion tags make my
> SysAdmins hesitant of its use on our production servers.)

 From what I understand, ZEO Raid is not yet completely finished. I'd 
speak to Christian Theune about it. I know he's very close to having it 
finished, but is looking for sponsorship to get it over the final hurdle.

RelStorage will let you store things in Oracle or Postgres and thus use 
their scalability features. It may be a more mature option. I know Jarn 
are using it currently.

However, as you've been told repeatedly, it's very unlikely, based on 
what you've told us, that your problems lie at the ZEO server, and thus 
that ZEORaid or RelStorage would help. It's extremely likely that the 
tracebacks you are seeing every second *are* the symptom of the problem, 
and those are *not* caused by ZEO server issues. If you had ZEO server 
issues, you'd been seeing different messages (related to the ability of 
the ZEO client to talk to the ZEO server).

> And, there doesn't seem to be the experience with large deployments.
> Even if we were to add consulting services to our sites, we're not
> certain we'd get a viable deployment that could handle hundreds of
> campus department web sites and hundreds of thousands of pages.

Lots of people run sites that are much bigger than what you've described.

> If you, or Martin, or someone else, based on a stack trace we were
> getting every 200 milliseconds could say -- "Oh, that looks like
> this, you could probably do that" -- I'd have a case to consider if
> hiring to address that problem would be a worthwhile investment.

I did that when you first posted it. The problem is in RedirectionTool. 
It tells you which line. Any capable developer will be able to at least 
do some debugging starting there. See also my suggestions above.

Martin

-- 
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book