[Setup] Fwd: Zope loses data irretrievably upon restart

Tue Jun 9 14:55:16 UTC 2015

I had a similar experience recently, with a totally different cause.
Someone had repartitioned the harddisk.  The Data.fs was at 
/home/somewhere/Data.fs on the root partition.  The harddisk then got a 
/home partition.  But the zeoserver was not restarted.  All seemed to 
keep working.  Zope kept writing to the file on the root partition. 
After a zeoserver restart it found the file on the new /home partition, 
which had the data of the time that the repartition happened.

I am not sure if they managed to recover the data by temporarily undoing 
the /home partition: unmounting it should bring back the root partition 
file that had the recent changes.

Your problem seems different.  I just wanted to add an extra data point.

I have not seen your exact problem before with the zeopacking.

Come to think of it, we rotate the logs weekly and have an extra line in 
the logrotate.conf that sends a USR2 kill signal to the zeoserver so 
that it reopens the log files.  I wonder if it reopens the Data.fs too.

But if this is the problem, then I would expect that you have 
encountered this a lot more than five times in the past two years.  I 
wonder if there is a race condition in the zeopack script that triggers 
this.

Just some thoughts.

Maurits

P.S. The buildout part that generates the logrotate.conf file:

[logrotate.conf]
recipe = zc.recipe.deployment:configuration
text =
     rotate 4
     weekly
     create
     compress
     delaycompress

     ${buildout:directory}/var/log/zeoclient*.log {
         sharedscripts
         postrotate
             /bin/kill -USR2 $(cat ${buildout:directory}/var/zeoclient.pid)
         endscript
     }

     ${buildout:directory}/var/log/zeoserver.log {
         postrotate
             /bin/kill -USR2 $(cat ${buildout:directory}/var/zeoserver.pid)
         endscript
     }

Then link to this file in /etc/logrotate.d/

Tom Cameron schreef op 27-05-15 om 06:03:
> Ive sent this to the zope list too but figured many people use Plone so
> I may get some feedback here too.
>
> We’ve been running Zope servers for about 15 years. In the last few
> years we have had a number of Zope servers instantly lose data upon
> restart. This has happened to us about 5 times over the past 2 years on
> 4 different servers.
>
> Upon restarting Zope we notice a rollback of data to sometimes months
> old and a Data.fs with a much older timestamp than expected. At this
> point the data was irretrievable. We find that it is not in any backups
> or anywhere on the file system.
>
> We have finally worked out what is happening but not exactly why. Below
> is a detailed explanation of what we are seeing.
>
> We are keen to know if anybody else is having similar issues or if
> perhaps we have setup something incorrectly and need to change how we
> pack the database. Also this may serve as a warning to others.
>
> *Explanation*
> The issue is that Zope is losing the file descriptor, the Data.fs still
> remains in the folder but Zope keeps writing to a (deleted) file
> descriptor which can be seen with lsof. Every server affected by this
> was running ZEO and had zeopack in the default configuration. We believe
> for some reason when zeopack tries to pack the ZODB, that when the
> Data.fs is moved and renamed to Data.fs.old and eventually deleted, Zope
> doesn’t recognize the new file.
>
> This is what we believe happens:
>
> 1. Zope is connected to a file Data.fs.
> 2. zeopack packs this file into a new file Data.fs.tmp
> 3. zeopack deletes any file named Data..fs.old
> 4. zeopack then renames Data.fs to Data.fs.old
> 5. zeopack then renames Data.fs.tmp to Data.fs
>
> The problem is that Zope is connected directly to the Data.fs file via
> its handler and even though it is renamed it continues to write to
> Data.fs.old.
>
> Later on, we do a subsequent zeopack and the Data.fs.old file is deleted
> in the process, but Zope keeps writing to this deleted file.
>
> We now have a situation where Zope is connected to a deleted file and is
> writing all its data to it. If Zope is ever stopped, the connection is
> lost and so effectively is the file.
>
> It is true that expert forensics could retrieve the file but in the
> first few instances that this happened we were totally baffled as to
> where the data had gone. In one case we had not restarted Zope for 4
> months and thus we lost 4 months of data.
>
> *Environment*
> We are running standard Plone servers on Debian 5.0.10 in this case,
> with following:
>
>   * Plone 4.3.4.1 (4309)
>   * CMF 2.2.8
>   * Zope 2.13.22
>   * Python 2.7.8 (default, Jan 27 2015, 14:19:28) [GCC 4.3.2]
>   * PIL 2.3.0 (Pillow)
>   * ZEO: plone.recipe.zeoserver 1.2.6
>
> Effective-user is set correctly for this instance “landcarevic” and
> properly mounts the Data.fs under normal circumstances.
>
> *Detailed Symptoms*
> This is a real case we discovered and managed to retrieve the lost file.
>
> Zope will continue to write to a Data.fs that has been technically
> deleted. We are able to notice this because the Data.fs.tmp will have a
> much more recent timestamp than the Data.fs in the var/filestorage
> directory.
>
> -rw-r--r-- 1 landcarevic psacln  6735952811 2015-04-12 00:11 Data.fs
>
> -rw-r--r-- 1 landcarevic psacln    18594749 2015-04-12 00:11 Data.fs.index
>
> -rw-r--r-- 1 landcarevic psacln           6 2015-03-30 11:20 Data.fs.lock
>
> -rw-r--r-- 1 landcarevic psaserv   14600344 2015-05-26 10:27 Data.fs.tmp
>
>
> An “lsof | grep Data.fs” can show if the file descriptor is pointing to
> a deleted file on the fs:
>
> python2.7  9307  landcarevic    7uW     REG               0,31
>           6   4986845
> /var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs.lock
> <http://landcarevic.net.au/plone43/var/filestorage/Data.fs.lock>
>
> python2.7  9307  landcarevic   11u      REG               0,31
>    14600344   4986235
> /var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs..tmp
> <http://landcarevic.net.au/plone43/var/filestorage/Data.fs.tmp>
>
> python2.7  9307  landcarevic   12u      REG               0,31
> 7058741434   4986349
>   (deleted)/var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs.old <http://landcarevic.net.au/plone43/var/filestorage/Data.fs.old>
>
>
> The process id for the instance is 9307 and it’s still holding open a
> Data.fs.old that has been deleted. Most likely due to a ZEO pack before
> from the cron in this case. Because it packs the Data.fs and then moves
> the old one to Data.fs.old which eventually gets deleted. But Zope never
> properly recognises the new Data.fs for some reason.
>
> *Recovery*
> As long as the Zope server doesn’t get shut down or restarted, it will
> hold this open and continue to write to the disk. It can be recovered by
> copying from /proc where /proc/{process ID}/fd/{file descriptor ID} can
> be found from running lsof as previously:
>
> # cp /proc/9307/fd/12 /home/alyssa/Data.fs-current
>
> # ls -al /home/alyssa/
>
> ...
>
> -rw-r--r--  1 root root 7058741434 2015-05-26 11:11 Data.fs-current
>
>
> Which has the correct timestamp and file size.
>
>
>
>
> --
> *Tom Cameron*
>
>
>
> --
> *Tom Cameron*
> Technical Director
>
> *Mooball IT*
>
> Level 1, 888 Brunswick Street, New Farm, Brisbane QLD 4005
> mob: +61 (4) 5584 1995  ph: +61 (7) 3040 4440
> http://www.mooball.com <http://www.mooball.com/>
>
> <http://www.mooball.com/>
>
> *Try Kitovu <https://kitovu.com/> - A better way to manage client work*
>
> Email disclaimer <http://www.mooball.com/email_disclaimer>
>
>

-- 
Maurits van Rees: http://maurits.vanrees.org/
Zest Software: http://zestsoftware.nl