[Setup] Fwd: Zope loses data irretrievably upon restart

Tom Cameron tom at mooball.net
Wed May 27 04:03:07 UTC 2015


Ive sent this to the zope list too but figured many people use Plone so I
may get some feedback here too.

We’ve been running Zope servers for about 15 years. In the last few years
we have had a number of Zope servers instantly lose data upon restart. This
has happened to us about 5 times over the past 2 years on 4 different
servers.

Upon restarting Zope we notice a rollback of data to sometimes months old
and a Data.fs with a much older timestamp than expected. At this point the
data was irretrievable. We find that it is not in any backups or anywhere
on the file system.

We have finally worked out what is happening but not exactly why. Below is
a detailed explanation of what we are seeing.

We are keen to know if anybody else is having similar issues or if perhaps
we have setup something incorrectly and need to change how we pack the
database. Also this may serve as a warning to others.

*Explanation*
The issue is that Zope is losing the file descriptor, the Data.fs still
remains in the folder but Zope keeps writing to a (deleted) file descriptor
which can be seen with lsof. Every server affected by this was running ZEO
and had zeopack in the default configuration. We believe for some reason
when zeopack tries to pack the ZODB, that when the Data.fs is moved and
renamed to Data.fs.old and eventually deleted, Zope doesn’t recognize the
new file.

This is what we believe happens:

1. Zope is connected to a file Data.fs.
2. zeopack packs this file into a new file Data.fs.tmp
3. zeopack deletes any file named Data.fs.old
4. zeopack then renames Data.fs to Data.fs.old
5. zeopack then renames Data.fs.tmp to Data.fs

The problem is that Zope is connected directly to the Data.fs file via its
handler and even though it is renamed it continues to write to Data.fs.old.

Later on, we do a subsequent zeopack and the Data.fs.old file is deleted in
the process, but Zope keeps writing to this deleted file.

We now have a situation where Zope is connected to a deleted file and is
writing all its data to it. If Zope is ever stopped, the connection is lost
and so effectively is the file.

It is true that expert forensics could retrieve the file but in the first
few instances that this happened we were totally baffled as to where the
data had gone. In one case we had not restarted Zope for 4 months and thus
we lost 4 months of data.

*Environment*
We are running standard Plone servers on Debian 5.0.10 in this case, with
following:

   - Plone 4.3.4.1 (4309)
   - CMF 2.2.8
   - Zope 2.13.22
   - Python 2.7.8 (default, Jan 27 2015, 14:19:28) [GCC 4.3.2]
   - PIL 2.3.0 (Pillow)
   - ZEO: plone.recipe.zeoserver 1.2.6

Effective-user is set correctly for this instance “landcarevic” and
properly mounts the Data.fs under normal circumstances.

*Detailed Symptoms*
This is a real case we discovered and managed to retrieve the lost file.

Zope will continue to write to a Data.fs that has been technically deleted.
We are able to notice this because the Data.fs.tmp will have a much more
recent timestamp than the Data.fs in the var/filestorage directory.

-rw-r--r-- 1 landcarevic psacln  6735952811 2015-04-12 00:11 Data.fs

-rw-r--r-- 1 landcarevic psacln    18594749 2015-04-12 00:11 Data.fs.index

-rw-r--r-- 1 landcarevic psacln           6 2015-03-30 11:20 Data.fs.lock

-rw-r--r-- 1 landcarevic psaserv   14600344 2015-05-26 10:27 Data.fs.tmp

An “lsof | grep Data.fs” can show if the file descriptor is pointing to a
deleted file on the fs:

python2.7  9307  landcarevic    7uW     REG               0,31          6
  4986845 /var/www/vhosts/
landcarevic.net.au/plone43/var/filestorage/Data.fs.lock

python2.7  9307  landcarevic   11u      REG               0,31   14600344
  4986235 /var/www/vhosts/
landcarevic.net.au/plone43/var/filestorage/Data.fs.tmp

python2.7  9307  landcarevic   12u      REG               0,31 7058741434
  4986349  (deleted)/var/www/vhosts/
landcarevic.net.au/plone43/var/filestorage/Data.fs.old

The process id for the instance is 9307 and it’s still holding open a
Data.fs.old that has been deleted. Most likely due to a ZEO pack before
from the cron in this case. Because it packs the Data.fs and then moves the
old one to Data.fs.old which eventually gets deleted. But Zope never
properly recognises the new Data.fs for some reason.

*Recovery*
As long as the Zope server doesn’t get shut down or restarted, it will hold
this open and continue to write to the disk. It can be recovered by copying
from /proc where /proc/{process ID}/fd/{file descriptor ID} can be found
from running lsof as previously:

# cp /proc/9307/fd/12 /home/alyssa/Data.fs-current

# ls -al /home/alyssa/

...

-rw-r--r--  1 root root 7058741434 2015-05-26 11:11 Data.fs-current

Which has the correct timestamp and file size.




-- 
*Tom Cameron*



-- 
*Tom Cameron*
Technical Director

*Mooball IT*

Level 1, 888 Brunswick Street, New Farm, Brisbane QLD 4005
mob: +61 (4) 5584 1995  ph: +61 (7) 3040 4440
http://www.mooball.com

<http://www.mooball.com/>

*Try Kitovu <https://kitovu.com/> - A better way to manage client work*

Email disclaimer <http://www.mooball.com/email_disclaimer>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.plone.org/pipermail/plone-setup/attachments/20150527/dbc4625d/attachment.html>


More information about the Setup mailing list