Difference between revisions of "HPC:Archive System"
(40 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
This page has details about the PMACS Archive System that attached to the PMACS HPC Cluster | This page has details about the PMACS Archive System that attached to the PMACS HPC Cluster | ||
− | === | + | === Using the Archive === |
− | + | ||
− | + | The Penn Medicine Academic Computing Services group provides a tape archive service at the High Performance Computing (HPC) Center. This service is designed to provide an inexpensive alternative to storage provided on hard drives for data that needs to be kept for a long time but is likely to be retrieved only rarely. | |
− | + | ||
+ | '''BE AWARE:''' tapes are kept secured within the tape library, and given the full protection of a Tier III data center, but if a disaster were to destroy the facility or the tape library itself, archived data will be lost, and is not recoverable. | ||
− | + | To access the archive you must have a valid PMACS userid and password and provide some billing information ahead of time. If you are unsure if your userid is valid on the PMACS network, you may login to: <strong>https://reset.pmacs.upenn.edu</strong> . And if you haven’t already done so, be sure to register your userid for self-service password resets. Once your userid is authorized for use with the HPC resources, you will receive first-time logon instructions. | |
− | |||
− | + | When you are ready to move data into the archive, request that a directory be created for you. | |
− | + | ==== Adding Data to the Archive ==== | |
− | + | Access to the archive is via our server "mercury" (see below for access instructions). Once there, you can use the rsync command with the specific options shown below, to copy files and directory structures into it. Because this is a user-accessible archive system, what you will see in that directory structure is not the actual files (which will have been moved off to a staging area and then written to 2 separate tapes) but a representation of them. In this way, you can always see what's in the archive (including file sizes, and date last modified) and delete anything you wish, at any time. (NOTE: The deletion process in the archive immediately makes those files inaccessible, and we have no other "backup" system in place.) | |
− | + | ===== Manual method ===== | |
+ | Here are the steps to place your files and folders into the archive: | ||
− | + | <strong> The maximum file size (per file) for the archive is 2TB</strong> | |
− | + | ====== Step 1: Login into the PMACS HPC File transfer server ====== | |
− | + | ssh to our server <code>mercury.pmacs.upenn.edu</code> <-- ''this step is often overlooked'' | |
− | + | ====== Step 2: rsync files ====== | |
− | + | Use this specific rsync command to copy files into the archive: | |
− | |||
− | |||
− | + | <pre>$ rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats {source} {destination}/ | |
+ | For example: | ||
+ | $ rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats /project/mylab/me /archivetape/mylab/ | ||
+ | </pre> | ||
+ | '''Note''' pay attention to the first two lines of output from the "stats" option which tells you the number of files in your source directory and the number of files copied. If those two numbers are not the same, please be sure you know why. | ||
− | ==> '''TIP''': | + | ==> '''TIP''': A trailing "/" makes a difference! using the slash at the end of the source path instructs rsync not to create that last sub-directory--and just copy it's |
− | contents (including all sub directories), whereas omitting the slash includes that directory, then it's contents. | + | contents (including all sub directories), whereas omitting the slash includes that last sub-directory, then it's contents. |
For example: | For example: | ||
− | $ rsync | + | $ rsync (options omitted) /home/rgodshal/pub/ /archivetape/rrg <-- ''trailing "/" on {source}'' |
[rgodshal@mercury ~]$ ls -l /archivetape/rrg | [rgodshal@mercury ~]$ ls -l /archivetape/rrg | ||
− | drwxrwxr-x 2 rgodshal rgodshal 4096 Oct 10 | + | drwxrwxr-x 2 rgodshal rgodshal 4096 Oct 10 2017 consign-opt <-- ''these files are the contents of /pub, in the rrg folder'' |
drwxr-xr-x 3 rgodshal rgodshal 4096 Aug 25 11:30 mathworks_downloads | drwxr-xr-x 3 rgodshal rgodshal 4096 Aug 25 11:30 mathworks_downloads | ||
− | -rw-r--r-- 1 rgodshal rgodshal 1017044 Jan 9 | + | -rw-r--r-- 1 rgodshal rgodshal 1017044 Jan 9 2017 RFS-v5 2 1-4145-release-notes.pdf |
− | |||
compared to: | compared to: | ||
− | $ rsync | + | $ rsync (options omitted) /home/rgodshal/pub /archivetape/rrg <-- ''no trailing "/" on {source}'' |
[rgodshal@mercury ~]$ ls /archivetape/rrg | [rgodshal@mercury ~]$ ls /archivetape/rrg | ||
drwxrwx--- 4 rgodshal rgodshal 32768 Mar 4 10:38 pub <-- ''there's /pub (with all it's contents)'' | drwxrwx--- 4 rgodshal rgodshal 32768 Mar 4 10:38 pub <-- ''there's /pub (with all it's contents)'' | ||
− | ==== Retrieving Data From the Archive: ==== | + | ===== Adding Data to the Archive using BSUB script ===== |
+ | |||
+ | The script below can be adapted to copy data stored under /home or /project directories on the HPC disk to the archive | ||
+ | |||
+ | <pre> | ||
+ | #!/bin/bash | ||
+ | # Job script to copy data from a $HOME or /project dir to the Archive system | ||
+ | DEST="/archivetape/<DESTINATION_DIR>" | ||
+ | SRC="<SOURCE_DIR>" | ||
+ | SERVER="mercury-eth5" | ||
+ | E_NODIR=89 | ||
+ | #BSUB -J archive_data_put # LSF job name | ||
+ | #BSUB -o archive_data_put.%J.out # Name of the job output file | ||
+ | #BSUB -e archive_data_put.%J.error # Name of the job error file | ||
+ | #BSUB -N | ||
+ | #BSUB -u <email>@upenn.edu | ||
+ | |||
+ | |||
+ | if [ ! -f $HOME/.ssh/id_rsa ] | ||
+ | then | ||
+ | ssh-keygen -f $HOME/.ssh/id_rsa -q -N "" | ||
+ | cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys | ||
+ | chmod 600 $HOME/.ssh/authorized_keys | ||
+ | fi | ||
+ | |||
+ | |||
+ | if [ ! -d $SRC ]; then | ||
+ | (>&2 echo "$SRC does not exist") | ||
+ | exit $E_NODIR | ||
+ | else | ||
+ | ssh -o "StrictHostKeyChecking no" $SERVER 'if [ ! -d $DEST ]; then | ||
+ | (>&2 echo "$DEST does not exist, create it first!") | ||
+ | exit $E_NODIR | ||
+ | fi' | ||
+ | rsync -rplot --inplace --no-partial --whole-file --no-checksum --progress --stats $SRC/ $SERVER:$DEST/ | ||
+ | fi | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | '''NOTE 1:''' You will need to set '''SRC''', '''DEST''' and '''email''' above | ||
+ | |||
+ | To run the above script from either the HPC head node, consign.pmacs.upenn.edu, or from an interactive session, execute the following command (assuming the script is called "archive_data_put.sh") | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | bsub < archive_data_put.sh | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | '''NOTE 2:''' the "<" above is not a typo and is required to ensure that the script is run as a LSF "job" script. When this job finishes it will send an email notification. | ||
+ | |||
+ | ==== Retrieving Data From the Archive ==== | ||
+ | |||
+ | When you wish to retrieve data from the archive, you can choose to copy single files, sets of files or directories back your /home or /project directory on mercury. | ||
+ | Before retrieving files from the archive there is a pre-fetch command to run which greatly increases the efficiency. It copies your data from tape to the archive system's disk cache, which then allows rsync or cp to perform optimally. It's particularly effective when retrieving many small files. If you use tar to consolidate and compress your directories before archiving them, and therefore just retrieving a single large tarball, there would be no need for the pre-fetch. | ||
+ | To utilize the pre-fetch, on mercury: | ||
+ | $ cd /archivetape/snutils/ | ||
+ | |||
+ | Now, request the system to pre-fetch the file or directory you wish to work with: | ||
+ | |||
+ | For example: | ||
+ | $ cd /archivetape/snutils/ | ||
+ | $ ./snretrieve -a /archivetape/mylab/my-dir | ||
+ | |||
+ | |||
+ | <strong>Please be careful to only fetch the data you need</strong>, to avoid unnecessarily filling the disk cache | ||
+ | |||
+ | After entering the snretrieve command, it will give you a job number and return you to the command prompt. You may then immediately begin your rsync or cp command. | ||
+ | |||
+ | Retrieving data is as simple as using rsync with the source and destination directories reversed from the command you used to place data into the archive. | ||
+ | |||
+ | For example: | ||
+ | rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats /archivetape/mylab/me /project/mylab/ | ||
+ | You can also use cp, or scp to transfer data in and out of the archive. We prefer rsync for it's capabilities of comparing source and destination to make efficient updates. | ||
+ | |||
+ | ===== Getting Data back from the archive using BSUB script ===== | ||
+ | |||
+ | The script below can be adapted to copy data stored in the archive back to /home or /project directories on the HPC disk | ||
+ | |||
+ | <pre> | ||
+ | #!/bin/bash | ||
+ | # Job script to copy data from an Archive directory to $HOME or /project dir | ||
+ | SRC="/archivetape/<SRC>" | ||
+ | DEST="<DESTINATION>" | ||
+ | SERVER="mercury" | ||
+ | E_NODIR=89 | ||
+ | #BSUB -J archive_data_get # LSF job name | ||
+ | #BSUB -o archive_data_get.%J.out # Name of the job output file | ||
+ | #BSUB -e archive_data_get.%J.error # Name of the job error file | ||
+ | #BSUB -N | ||
+ | #BSUB -u <email>@upenn.edu | ||
+ | |||
+ | if [ ! -f $HOME/.ssh/id_rsa ] | ||
+ | then | ||
+ | ssh-keygen -f $HOME/.ssh/id_rsa -q -N "" | ||
+ | cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys | ||
+ | chmod 600 $HOME/.ssh/authorized_keys | ||
+ | fi | ||
+ | |||
+ | |||
+ | |||
+ | if [ ! -d $DEST ]; then | ||
+ | (>&2 echo "$DEST does not exist, create it first!") | ||
+ | exit $E_NODIR | ||
+ | else | ||
+ | ssh $SERVER 'if [ ! -d $SRC ]; then | ||
+ | (>&2 echo "$SRC does not exist, no data to retrieve!") | ||
+ | exit $E_NODIR | ||
+ | fi' | ||
+ | ssh $SERVER "cd /archivetape/snutils/; ./snretrieve -a ${SRC}" | ||
+ | rsync -rplot --inplace --no-partial --whole-file --no-checksum --progress --stats $SERVER:$SRC/ $DEST/ | ||
+ | fi | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | '''NOTE 1:''' You will need to set '''SRC''', '''DEST''' and '''email''' above | ||
+ | |||
+ | To run the above script from either the HPC head node, consign.pmacs.upenn.edu, or from an interactive session, execute the following command (assuming the script is called "archive_data_get.sh") | ||
+ | |||
+ | <pre> | ||
− | + | bsub < archive_data_get.sh | |
− | |||
− | |||
− | |||
− | ==== Deleting Data from Archive | + | </pre> |
+ | |||
+ | '''NOTE 2:''' the "<" above is not a typo and is required to ensure that the script is run as a LSF "job" script. When this job finishes it will send an email notification. | ||
+ | |||
+ | ==== Deleting Data from Archive ==== | ||
PLEASE be sure that you have retrieved files you want to keep before deleting them from the archive. This is your only "backup" copy in the HPC environment! Use the "rm" command as you would for ordinary files and directories: | PLEASE be sure that you have retrieved files you want to keep before deleting them from the archive. This is your only "backup" copy in the HPC environment! Use the "rm" command as you would for ordinary files and directories: | ||
rm -rf /archivetape/mylab/me/completed | rm -rf /archivetape/mylab/me/completed | ||
+ | |||
+ | ==== Checking usage ==== | ||
+ | |||
+ | If you want to check how much space is being used by data in your archive directory, use | ||
+ | <pre>$ du -hbs /archivetape/mylab</pre> | ||
+ | |||
+ | '''NOTE:''' The archive system is a tiered storage system with a disk cache layer, where data is placed temporarily while being placed into/retrieved from the archive, and the underlying tape based system for long term storage. It is expected that while checking for disk usage using the "du" command without the correct options (noted above), an output of "du" will show zero bytes. Using the correct "du" command options, or alternatively, the "ls -lh" command, will provide the correct output as noted in the example below: | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | [asrini@mercury ~]$ ls -lh /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | -rw-rw-r-- 1 asrini era_team 245G Jun 15 2015 /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | |||
+ | [asrini@mercury ~]$ du /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | 0 /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | |||
+ | [asrini@mercury ~]$ du -sh /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | 0 /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | |||
+ | [asrini@mercury ~]$ du -hbs /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | 263066746880 /archivetape/mylab/asrini/my_very_old_file.tar.gz | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | === Other Pages === | ||
+ | ---- | ||
+ | <div class="mw-collapsible mw-collapsed"> | ||
+ | *[[HPC:Main_Page|HPC Main Page]] | ||
+ | *[[HPC:User_Guide|User Guide]] | ||
+ | *[[HPC:Software|Available Software]] | ||
+ | *[[HPC:FAQ|HPC FAQ ]] | ||
+ | *[[HPC:Login|Connecting to the PMACS cluster]] | ||
+ | </div> |
Latest revision as of 16:52, 13 December 2022
This page has details about the PMACS Archive System that attached to the PMACS HPC Cluster
Contents
Using the Archive
The Penn Medicine Academic Computing Services group provides a tape archive service at the High Performance Computing (HPC) Center. This service is designed to provide an inexpensive alternative to storage provided on hard drives for data that needs to be kept for a long time but is likely to be retrieved only rarely.
BE AWARE: tapes are kept secured within the tape library, and given the full protection of a Tier III data center, but if a disaster were to destroy the facility or the tape library itself, archived data will be lost, and is not recoverable.
To access the archive you must have a valid PMACS userid and password and provide some billing information ahead of time. If you are unsure if your userid is valid on the PMACS network, you may login to: https://reset.pmacs.upenn.edu . And if you haven’t already done so, be sure to register your userid for self-service password resets. Once your userid is authorized for use with the HPC resources, you will receive first-time logon instructions.
When you are ready to move data into the archive, request that a directory be created for you.
Adding Data to the Archive
Access to the archive is via our server "mercury" (see below for access instructions). Once there, you can use the rsync command with the specific options shown below, to copy files and directory structures into it. Because this is a user-accessible archive system, what you will see in that directory structure is not the actual files (which will have been moved off to a staging area and then written to 2 separate tapes) but a representation of them. In this way, you can always see what's in the archive (including file sizes, and date last modified) and delete anything you wish, at any time. (NOTE: The deletion process in the archive immediately makes those files inaccessible, and we have no other "backup" system in place.)
Manual method
Here are the steps to place your files and folders into the archive:
The maximum file size (per file) for the archive is 2TB
Step 1: Login into the PMACS HPC File transfer server
ssh to our server mercury.pmacs.upenn.edu
<-- this step is often overlooked
Step 2: rsync files
Use this specific rsync command to copy files into the archive:
$ rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats {source} {destination}/ For example: $ rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats /project/mylab/me /archivetape/mylab/
Note pay attention to the first two lines of output from the "stats" option which tells you the number of files in your source directory and the number of files copied. If those two numbers are not the same, please be sure you know why.
==> TIP: A trailing "/" makes a difference! using the slash at the end of the source path instructs rsync not to create that last sub-directory--and just copy it's contents (including all sub directories), whereas omitting the slash includes that last sub-directory, then it's contents.
For example: $ rsync (options omitted) /home/rgodshal/pub/ /archivetape/rrg <-- trailing "/" on {source} [rgodshal@mercury ~]$ ls -l /archivetape/rrg drwxrwxr-x 2 rgodshal rgodshal 4096 Oct 10 2017 consign-opt <-- these files are the contents of /pub, in the rrg folder drwxr-xr-x 3 rgodshal rgodshal 4096 Aug 25 11:30 mathworks_downloads -rw-r--r-- 1 rgodshal rgodshal 1017044 Jan 9 2017 RFS-v5 2 1-4145-release-notes.pdf compared to: $ rsync (options omitted) /home/rgodshal/pub /archivetape/rrg <-- no trailing "/" on {source} [rgodshal@mercury ~]$ ls /archivetape/rrg drwxrwx--- 4 rgodshal rgodshal 32768 Mar 4 10:38 pub <-- there's /pub (with all it's contents)
Adding Data to the Archive using BSUB script
The script below can be adapted to copy data stored under /home or /project directories on the HPC disk to the archive
#!/bin/bash # Job script to copy data from a $HOME or /project dir to the Archive system DEST="/archivetape/<DESTINATION_DIR>" SRC="<SOURCE_DIR>" SERVER="mercury-eth5" E_NODIR=89 #BSUB -J archive_data_put # LSF job name #BSUB -o archive_data_put.%J.out # Name of the job output file #BSUB -e archive_data_put.%J.error # Name of the job error file #BSUB -N #BSUB -u <email>@upenn.edu if [ ! -f $HOME/.ssh/id_rsa ] then ssh-keygen -f $HOME/.ssh/id_rsa -q -N "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys chmod 600 $HOME/.ssh/authorized_keys fi if [ ! -d $SRC ]; then (>&2 echo "$SRC does not exist") exit $E_NODIR else ssh -o "StrictHostKeyChecking no" $SERVER 'if [ ! -d $DEST ]; then (>&2 echo "$DEST does not exist, create it first!") exit $E_NODIR fi' rsync -rplot --inplace --no-partial --whole-file --no-checksum --progress --stats $SRC/ $SERVER:$DEST/ fi
NOTE 1: You will need to set SRC, DEST and email above
To run the above script from either the HPC head node, consign.pmacs.upenn.edu, or from an interactive session, execute the following command (assuming the script is called "archive_data_put.sh")
bsub < archive_data_put.sh
NOTE 2: the "<" above is not a typo and is required to ensure that the script is run as a LSF "job" script. When this job finishes it will send an email notification.
Retrieving Data From the Archive
When you wish to retrieve data from the archive, you can choose to copy single files, sets of files or directories back your /home or /project directory on mercury. Before retrieving files from the archive there is a pre-fetch command to run which greatly increases the efficiency. It copies your data from tape to the archive system's disk cache, which then allows rsync or cp to perform optimally. It's particularly effective when retrieving many small files. If you use tar to consolidate and compress your directories before archiving them, and therefore just retrieving a single large tarball, there would be no need for the pre-fetch. To utilize the pre-fetch, on mercury: $ cd /archivetape/snutils/
Now, request the system to pre-fetch the file or directory you wish to work with:
For example:
$ cd /archivetape/snutils/ $ ./snretrieve -a /archivetape/mylab/my-dir
Please be careful to only fetch the data you need, to avoid unnecessarily filling the disk cache
After entering the snretrieve command, it will give you a job number and return you to the command prompt. You may then immediately begin your rsync or cp command.
Retrieving data is as simple as using rsync with the source and destination directories reversed from the command you used to place data into the archive.
For example:
rsync -rplot --inplace --no-partial --whole-file --no-checksum --stats /archivetape/mylab/me /project/mylab/
You can also use cp, or scp to transfer data in and out of the archive. We prefer rsync for it's capabilities of comparing source and destination to make efficient updates.
Getting Data back from the archive using BSUB script
The script below can be adapted to copy data stored in the archive back to /home or /project directories on the HPC disk
#!/bin/bash # Job script to copy data from an Archive directory to $HOME or /project dir SRC="/archivetape/<SRC>" DEST="<DESTINATION>" SERVER="mercury" E_NODIR=89 #BSUB -J archive_data_get # LSF job name #BSUB -o archive_data_get.%J.out # Name of the job output file #BSUB -e archive_data_get.%J.error # Name of the job error file #BSUB -N #BSUB -u <email>@upenn.edu if [ ! -f $HOME/.ssh/id_rsa ] then ssh-keygen -f $HOME/.ssh/id_rsa -q -N "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys chmod 600 $HOME/.ssh/authorized_keys fi if [ ! -d $DEST ]; then (>&2 echo "$DEST does not exist, create it first!") exit $E_NODIR else ssh $SERVER 'if [ ! -d $SRC ]; then (>&2 echo "$SRC does not exist, no data to retrieve!") exit $E_NODIR fi' ssh $SERVER "cd /archivetape/snutils/; ./snretrieve -a ${SRC}" rsync -rplot --inplace --no-partial --whole-file --no-checksum --progress --stats $SERVER:$SRC/ $DEST/ fi
NOTE 1: You will need to set SRC, DEST and email above
To run the above script from either the HPC head node, consign.pmacs.upenn.edu, or from an interactive session, execute the following command (assuming the script is called "archive_data_get.sh")
bsub < archive_data_get.sh
NOTE 2: the "<" above is not a typo and is required to ensure that the script is run as a LSF "job" script. When this job finishes it will send an email notification.
Deleting Data from Archive
PLEASE be sure that you have retrieved files you want to keep before deleting them from the archive. This is your only "backup" copy in the HPC environment! Use the "rm" command as you would for ordinary files and directories:
rm -rf /archivetape/mylab/me/completed
Checking usage
If you want to check how much space is being used by data in your archive directory, use
$ du -hbs /archivetape/mylab
NOTE: The archive system is a tiered storage system with a disk cache layer, where data is placed temporarily while being placed into/retrieved from the archive, and the underlying tape based system for long term storage. It is expected that while checking for disk usage using the "du" command without the correct options (noted above), an output of "du" will show zero bytes. Using the correct "du" command options, or alternatively, the "ls -lh" command, will provide the correct output as noted in the example below:
[asrini@mercury ~]$ ls -lh /archivetape/mylab/asrini/my_very_old_file.tar.gz -rw-rw-r-- 1 asrini era_team 245G Jun 15 2015 /archivetape/mylab/asrini/my_very_old_file.tar.gz [asrini@mercury ~]$ du /archivetape/mylab/asrini/my_very_old_file.tar.gz 0 /archivetape/mylab/asrini/my_very_old_file.tar.gz [asrini@mercury ~]$ du -sh /archivetape/mylab/asrini/my_very_old_file.tar.gz 0 /archivetape/mylab/asrini/my_very_old_file.tar.gz [asrini@mercury ~]$ du -hbs /archivetape/mylab/asrini/my_very_old_file.tar.gz 263066746880 /archivetape/mylab/asrini/my_very_old_file.tar.gz