Difference between revisions of "HPC:Archive System"

From HPC wiki
Line 26: Line 26:
 
<pre>$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats {source} {destination}/
 
<pre>$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats {source} {destination}/
  
For example:$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats /project/mylab/me /archivetape/mylab/
+
For example:
 +
$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats /project/mylab/me /archivetape/mylab/
 
</pre>
 
</pre>
 
'''Note'''  pay attention to the first two lines of output from the "stats" option which tells you the number of files in your source directory and the number of files copied.  If those two numbers are not the same, please be sure you know why.
 
'''Note'''  pay attention to the first two lines of output from the "stats" option which tells you the number of files in your source directory and the number of files copied.  If those two numbers are not the same, please be sure you know why.

Revision as of 14:57, 13 August 2019

This page has details about the PMACS Archive System that attached to the PMACS HPC Cluster

Using the Archive

The Penn Medicine Academic Computing Services group provides a tape archive service at the High Performance Computing (HPC) Center. This service is designed to provide an inexpensive alternative to storage provided on hard drives for data that needs to be kept for a long time but is likely to be retrieved only rarely.

BE AWARE: tapes are kept secured within the tape library, and given the full protection of a Tier III data center, but if a disaster were to destroy the facility or the tape library itself, archived data will be lost, and is not recoverable.

To access the archive you must have a valid PMACS userid and password and provide some billing information ahead of time. If you are unsure if your userid is valid on the PMACS network, you may login to: https://reset.pmacs.upenn.edu . And if you haven’t already done so, be sure to register your userid for self-service password resets. Once your userid is authorized for use with the HPC resources, you will receive first-time logon instructions.

When you are ready to move data into the archive, request that a directory be created for you.

Adding Data to the Archive:

Access to the archive is via our server "mercury" (see below for access instructions). Once there, you can use the rsync command with the specific options shown below, to copy files and directory structures into it. Because this is a user-accessible archive system, what you will see in that directory structure is not the actual files (which will have been moved off to a staging area and then written to 2 separate tapes) but a representation of them. In this way, you can always see what's in the archive (including file sizes, and date last modified) and delete anything you wish, at any time. (NOTE: The deletion process in the archive immediately makes those files inaccessible, and we have no other "backup" system in place.)

Here are the steps to place your files and folders into the archive:

Step 1: Login into the PMACS HPC File transfer server

ssh to our server mercury.pmacs.upenn.edu <-- this step is often overlooked

Step 2: rsync files

Use this specific rsync command to copy files into the archive:

$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats {source} {destination}/

For example:
$ rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats /project/mylab/me /archivetape/mylab/

Note pay attention to the first two lines of output from the "stats" option which tells you the number of files in your source directory and the number of files copied. If those two numbers are not the same, please be sure you know why.

==> TIP: A trailing "/" makes a difference! using the slash at the end of the source path instructs rsync not to create that last sub-directory--and just copy it's contents (including all sub directories), whereas omitting the slash includes that last sub-directory, then it's contents.

 For example:
 $ rsync (options omitted) /home/rgodshal/pub/ /archivetape/rrg  <-- trailing "/" on {source}
 [rgodshal@mercury ~]$ ls -l /archivetape/rrg
 drwxrwxr-x 2 rgodshal rgodshal      4096 Oct 10  2017 consign-opt  <-- these files are the contents of /pub, in the rrg folder
 drwxr-xr-x 3 rgodshal rgodshal      4096 Aug 25 11:30 mathworks_downloads
 -rw-r--r-- 1 rgodshal rgodshal   1017044 Jan  9  2017 RFS-v5 2 1-4145-release-notes.pdf
 compared to:
 $ rsync (options omitted) /home/rgodshal/pub /archivetape/rrg  <-- no trailing "/" on {source}
 [rgodshal@mercury ~]$ ls /archivetape/rrg
 drwxrwx--- 4 rgodshal rgodshal 32768 Mar  4 10:38 pub  <-- there's /pub (with all it's contents)

Retrieving Data From the Archive:

When you wish to retrieve data from the archive, you can choose to copy single files, sets of files or directories back your /home or /project directory on mercury. Before retrieving files from the archive there is a pre-fetch command to run which greatly increases the efficiency. It copies your data from tape to the archive system's disk cache, which then allows rsync or cp to perform optimally. It's particularly effective when retrieving many small files. If you use tar to consolidate and compress your directories before archiving them, and therefore just retrieving a single large tarball, there would be no need for the pre-fetch. To utilize the pre-fetch, on mercury: $ cd /archivetape/snutils/

Now, request the system to pre-fetch the file or directory you wish to work with:

For example:

$ cd /archivetape/snutils/
$ ./snretrieve -a /archivetape/mylab/my-dir


!! Please be careful to only fetch the data you need, to avoid unnecessarily filling the disk cache

After entering the snretrieve command, it will give you a job number and return you to the command prompt. You may then immediately begin your rsync or cp command.

Retrieving data is as simple as using rsync with the source and destination directories reversed from the command you used to place data into the archive.

For example:

 rsync -rgplot --inplace --no-partial --whole-file --no-checksum --stats /archivetape/mylab/me /project/mylab/

You can also use cp, or scp to transfer data in and out of the archive. We prefer rsync for it's capabilities of comparing source and destination to make efficient updates.

Deleting Data from Archive:

PLEASE be sure that you have retrieved files you want to keep before deleting them from the archive. This is your only "backup" copy in the HPC environment! Use the "rm" command as you would for ordinary files and directories:

 rm -rf /archivetape/mylab/me/completed

Checking usage:

If you want to check how much space is being used by data in your archive directory, use "du -bs /archivetape/mylab" That returns the total amount of space used, in bytes. Convert to GB by dividing that number by 1073741824 (1024*1024*1024)

Other Pages