Backup operation fails (PVA 4.6)

Discussion in 'Performance Questions' started by BiznetAdmin, Mar 2, 2012.

  1. BiznetAdmin

    BiznetAdmin New Member

    Messages:
    10
    I am trying to backup a container, but the backup operation always fails. I receive the following messages in the task log:

    Furthermore, before the operation fails, the backup process was stuck at 3% for a few hours.

    I have increased (I believe) the timeout to 36000 through the Parallels Management Console. Where can I look for additional information about why the backup operation fails?

    The Container I am trying to backup has 150GB assigned to it and it is using about 70GB, if that matters.
  2. MakcyD

    MakcyD Parallels Team

    Messages:
    66
    Hi,

    Actually there is not sufficient information to determine the cause of the issue.
    1. Is backup made locally or to remote storage?
    2. How exactly backup is launched (using scheduled task in PVA MN, using Parallels Management Console, using vzabackup\vzbackup tool)
    3. What OS\PVC version is used?
  3. philippe henneau

    philippe henneau New Member

    Messages:
    64
    we do also have backup issues since uprade to latest version. We do have to reset the PVA container every 2-3 days, then backup work fine again for a few days.
    We also noticed that we can't backup PBSM node on a VZ node anymore, this was supported before.

    Sys info:

    Version Release
    pva-release 4.6 1861
    pva-cp-core 4.6 464
    pva-cp-parallels 4.6 358.1
    pva-cp-plesk 4.6 106.2
    pva-cp-virtuozzo 4.6 343.1

    Regards
    Philippe
    Last edited: Mar 3, 2012
  4. BiznetAdmin

    BiznetAdmin New Member

    Messages:
    10
    1. Backup is made locally (on the Hardware Node where the Container is found)
    2. Backup is launched manually via Parallels Virtual Automation.
    3. OS is Centos 5.7 (64-bit).
    Parallels Virtual Automation is 4.6.
    Parallels Virtuozzo is 4.7.
    Last edited: Mar 4, 2012
  5. BiznetAdmin

    BiznetAdmin New Member

    Messages:
    10
    Can you please tell me how can I initiate the backup procedure from CLI? I have seen the "vzbackup" command but from what I understand it requires a Backup Node which I do not have. I would like to make a backup locally on the Hardware Node where the Container is found.
  6. MakcyD

    MakcyD Parallels Team

    Messages:
    66
    Hi,

    Common way to crate backup in CLI (and also troubleshoot PVA-based backup) is to use vzabackup tool:
    Code:
    ~# vzabackup -F $HWNODE_HOSTNAME -e $CTID
    e.g. for me it was:
    Code:
    [22:42:31][~]# vzabackup -F localhost -e 200
    Starting backup operation for node 'pvcnode.parallels.tld'...
    * Operation with the Container myhost.tld is started
    * Backing up environment myhost.tld locally
    * Checking parameters
    * Dumping quota
    * Creating backup f8bbf37a-cdcc-8745-977f-471f8b753f97/20120311164308
    * Adjusting backup type (full)
    * Backup storage: receiving backup file
    * Preparing for backup operation
    * Backing up private area
     100% |**************************************************************|
    * Sending private backup data
    * Backup storage: storing private backup data
    * Backup storage: filling resultant backup info
    * Operation with the Container myhost.tld is finished successfully.
    Backup operation for node 'pvcnode.parallels.tld' was finished successfully.
    
    Also I would suggest to update PVA Agent on your PVC node up to build 4.6-1866.
    In that particular build were fixed a number of bugs related to backup functionality, so it might fix the issue.


    In your particular case there could be several possible causes:
    1. Too small timeout (generally, 10 hours should be enough, but it may depend on the second point)
    2. Not enough swap\RAM on the server (but in build 4.6-1866 back-end backup libraries are using less memory than in previous builds)

    If the backup from command-line goes fine, then the issue is in PVA Management Node and it is necessary to increase timeouts there:
    Code:
    [root@pvaint ~]# grep heavy_timeout /var/opt/pva/mn/etc/vzagent.conf
    <heavy_timeout_limit>360000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    [root@pvaint ~]#
    and restart pvamnd:
    Code:
    ~# pvamn restart
    Next step is to verify the timeouts on PVC node:
    Code:
    [22:51:51][~]# grep heavy_timeout /vz/pva/agent/etc/vzagent.conf
    <heavy_timeout_limit>360000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    <heavy_timeout_limit>36000</heavy_timeout_limit>
    [22:52:31][~]#
    and restart pvaagent:
    Code:
    ~# pvaagent restart
    If that does not help, it is necessary to find the error code given in vzabackup output and analyse /var/log/pva/agent/YYYY.MM.DD-vzagent.log for errors.

    If you are unable to find the root cause of the issue, please post vzabackup output here and attach the archive of corresponding vzagent.log so I can check the issue in more details.
  7. BiznetAdmin

    BiznetAdmin New Member

    Messages:
    10
    Unfortunately, backup could not be completed once again. The output of vzabackup is the following:

    I am also attaching the vzagent log you requested. I could not find what might be the cause.

    We are already using "Parallels Virtual Automation 4.6-1866".
    The server has approximately 30GB of RAM+Swap, so I believe that running out of RAM/Swap is not the case.

    Regarding the other files that you mention:
    /vz/pva/agent/etc/vzagent.conf: No such file or directory
    /var/opt/pva/mn/etc/vzagent.conf: No such file or directory

    I would also like to mention that the backup procedure always gets stuck at 26%*. I don't know if this is something that can be used to diagnose the problem, but I though I should mention it.

    * In the first post I mention 3%. That is on another container so please do not be confused. The two containers are identical though.

    Attached Files:

    Last edited: Mar 13, 2012
  8. figjam

    figjam New Member

    Messages:
    110
    I have exactly the same thing happening on the last two nodes I have rebuilt. One was CentOS 5.7 (I just upgraded to 5.8 on the off chance that it might change something) and the other is 5.8. Both nodes are:

    Parallels Virtuozzo Containers 4.0.0
    Parallels Virtual Automation 4.6-1509

    One is running 2.6.18-028stab099.3 and the other 2.6.18-028stab098.1.
    All nodes backup to an NFS mount.
    vzabackup is launched via a cronjob on each hwnode.
    Changing the backup settings for the hwnode in PVA, from Global to local makes no difference.
    Running from the command line makes no difference.
    Each node has a container that times out after ten hours.
    They hang at random points during the backup i.e. the percentage complete is different each time.
    The containers are not that large: 22.3 GB & 12.6 GB in use.
    There was another container on one of these nodes that suddenly started backing up again.
    The container OSs are: ubuntu-10.04-x86_64, debian-5.0-x86.
    Restarting pvamn and pvaagent make no difference.

    If I execute: "pgrep -lf vzlpl" after restarting pvaagent I will usually see one process. I have to use "kill -9" to get rid of it. It will not close gracefully.

    # pgrep -lf vzlpl
    139374 /opt/pva/agent/bin/vzlpl /vz/pva/agent/tmp.DzdIgg



    Executing "lsof -p <PID>" will show you the file it has stopped on near the bottom of the output. You can also do this while the backup is running, you just need to find the right vzlpl process.

    # lsof -p 139374
    <lots of output> ...
    vzlpl 139374 root 22r REG 104,3 178411 5233452 /opt/pva/agent/bin/vzlpl.dat
    vzlpl 139374 root 23u unix 0xffff810218bbd0c0 0t0 27937416 socket
    vzlpl 139374 root 24r DIR 105,1 4096 130034256 /vz/private/100
    vzlpl 139374 root 25r DIR 105,1 4096 130561914 /vz/private/100/fs
    vzlpl 139374 root 26r DIR 105,1 4096 130561915 /vz/private/100/fs/root
    vzlpl 139374 root 27r DIR 105,1 4096 134763931 /vz/private/100/fs/root/var
    vzlpl 139374 root 28r DIR 105,1 4096 137894274 /vz/private/100/fs/root/var/lib
    vzlpl 139374 root 29r DIR 105,1 4096 137897340 /vz/private/100/fs/root/var/lib/jenkins
    vzlpl 139374 root 30r DIR 105,1 4096 138414675 /vz/private/100/fs/root/var/lib/jenkins/.rvm
    vzlpl 139374 root 31r DIR 105,1 4096 138421606 /vz/private/100/fs/root/var/lib/jenkins/.rvm/gems
    vzlpl 139374 root 32r DIR 105,1 4096 138421608 /vz/private/100/fs/root/var/lib/jenkins/.rvm/gems/ruby-1.9.2-p290
    vzlpl 139374 root 33r DIR 105,1 20480 138421609 /vz/private/100/fs/root/var/lib/jenkins/.rvm/gems/ruby-1.9.2-p290/cache
    vzlpl 139374 root 42w FIFO 0,6 0t0 27937092 pipe
    vzlpl 139374 root 43r FIFO 0,6 0t0 27937093 pipe


    Error 307 does not appear in "/var/log/pva/agent/YYYY.MM.DD-vzagent.log". The relevant entries appear to be:

    T=03:10:47:043; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [terminate] Terminating 387924
    T=03:10:47:043; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [terminate] Terminating 387925
    T=03:10:47:045; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [terminate] Terminating 387925
    T=03:10:47:049; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [close] Close transport 387924 with code 104: Pipe read error: 104 (Connection reset by peer)
    T=03:10:47:052; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect transport 387924
    T=03:10:47:055; L=(inf); PID=123019; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] proc_info: Stopping the monitor for backupm
    T=03:10:47:057; L=(inf); PID=123029; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] vzaproc_info: Stopping the monitor for backupm
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [close] Close vzclient168-34a9f422-5f2a-7747-a1dd-6956a2d5ea54
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [terminate] Terminating -1
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [close] Close transport -1 with error: Connection closed
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect transport -1
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect vzclient168-34a9f422-5f2a-7747-a1dd-6956a2d5ea54
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [close] Close transport -1 with error: Connection closed
    T=03:10:47:070; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect transport -1
    T=03:10:47:089; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect: Name: vzclient168-34a9f422-5f2a-7747-a1dd-6956a2d5ea54
    T=03:10:47:092; L=(inf); PID=123019; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] proc_info: Stopping the monitor for vzclient168-34a9f422-5f2a-7747-a1dd-6956a2d5ea54
    T=03:10:47:094; L=(inf); PID=123029; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] vzaproc_info: Stopping the monitor for vzclient168-34a9f422-5f2a-7747-a1dd-6956a2d5ea54
    T=03:10:47:204; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [close] Close transport 387925 with code -4: Connection closed
    T=03:10:47:207; L=(inf); PID=121830; TID=2b90b5a93b00; P=VZLControl [disconnect] Disconnect transport 387925
    T=03:10:47:209; L=(inf); PID=123019; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] proc_info: Stopping the monitor for vzlin_backup_serializer
    T=03:10:47:212; L=(inf); PID=123029; TID=2b90b5a93b00; P=VZLOperatorCommon [disconnectClient] vzaproc_info: Stopping the monitor for vzlin_backup_serializer
    T=03:11:14:233; L=(udf); PID=123026; TID=2b90b5a93b00; P=VZABasicFunctionality [timeoutEvent] Process status change ctid#100 (8d3bcb5e-a065-dd42-a7fb-d3b0174bc450) old:6/15 -> new:6/0
    T=03:11:14:238; L=(inf); PID=123026; TID=2b90b5a93b00; P=VZLLibCore [handleEvent] Updating status of env 8d3bcb5e-a065-dd42-a7fb-d3b0174bc450 (cur parent 34a9f422-5f2a-7747-a1dd-6956a2d5ea54, event from 34a9f422-5f2a-7747-a1dd-6956a2d5ea54, new status is 6, transition 0)



    All the other nodes I have are running 2.6.18-028stab095.1 or earlier, maybe it's a bug with the newer kernels?

    Cheers,


    Andrew
    Last edited: Apr 11, 2012
  9. Paul Groeneweg

    Paul Groeneweg New Member

    Messages:
    6
    I am having same problem with backups on two HN, some VE finishes with error #307. Not all VEs are failing and the VEs failing are not the biggest ones.

    The two HN both run Centos 5.8 and a different kernel (2.6.32-042stab039.11 2.6.32-042stab053.5 ).
    We have installed pva-release-4.6-1866.

    I tried to increase timeouts, run vzabackup from commandline. But backup still ends before finished.
    I tried to backup to a non iscsi backup drive, but fails too.
    The different VEs end on a diferent percent ( 10% and 24%) When I run vzabackup again, it again hangs a few minutes on the same percentage and then ends with #307.

    So I though maybe some corrupt fs.
    I did a vzfsutil to correct any broken links that might cause the backup to fail, but no luck.

    I am thinking it could be a corrupt filesystem?! So is it safe to run fsck? Or does this destroy files like the cow files?


    Further; vzbackup seems to run fine although gives some Backup nodes map error:

    logfile in backup folder:
    2012-04-21T16:14:25+0200 vzbackup1 (xxxxx:175): Archive with tag 2012-04-21T154947+0200 @ xxxx created. Mode I.
    2012-04-21T16:14:25+0200 vzbackup1 (xxxxx:175): Cleaning up...
    2012-04-21T16:14:25+0200 vzbackup1 (xxxxx:175): Done.
    Warning: Failed to load Backup nodes map.
    2012-04-21T16:14:27+0200 vzbackup(6500): Backup node xxxxx finished OK. Containers - 175.
  10. figjam

    figjam New Member

    Messages:
    110
    Kernel Version

    BiznetAdmin & Philippe,

    Can you please tell us which kernel version you are using on the problem hardware nodes?

    Cheers,


    Andrew
  11. Paul Groeneweg

    Paul Groeneweg New Member

    Messages:
    6
    I tried also to turn off the VE and then run a vzabackup. It processed further as when running, but in the end it stayed at 100% without completing, then there is in the /tmp folder a DemonData file which fills up my complete /tmp folder ( +/- 5GB ).

    So I really think towards some kind of corrrupted data. Why the /vz/tmp is not used as temporary folder for the DemonData I don't understand.
  12. figjam

    figjam New Member

    Messages:
    110
    Related issue?

    I've also had an issue for a long time where random containers would immediately time out from the command line with the same "#307 Message processing was terminated because timeout was reached waiting for a response" before it got to the progress bar. I would have up to 3 of these a week.

    # vzabackup -i localhost -e 100

    Starting backup operation for node 'hwnodex.net24.net.nz'...
    * Operation with the Container Customer is started
    * Backing up environment Customer locally
    * Checking parameters
    * Dumping quota
    * Creating backup d8b6a95f-506a-7e41-b5e2-bca5d62e5dd3/20120422211750
    * Adjusting backup type (incremental)
    * Backup storage: receiving backup file
    * Preparing for backup operation
    * Snapshots are not created, continuing backup...
    * Backing up private area
    * #307 Message processing was terminated because timeout was reached waiting for a response
    * Operation with the Container Customer is finished with errors: The request was timed out.
    Backup operation for node 'hwnodex.net24.net.nz' failed: The request was timed out
    Backup failed


    These can always be fixed by re-running the backup as a full backup but that means that I lose history because I only keep 2 full backups. I have attached the relevant log entries. Is anyone else seeing this?

    Cheers,


    Andrew

    Attached Files:

  13. figjam

    figjam New Member

    Messages:
    110
    I should have tried this way before now, but deleting the last incremental also works. This seems to point the finger at the last incremental backup being the problem even though it was supposedly successfully completed.

    Cheers,


    Andrew

Share This Page