How to transfer files from Google Cloud Storage (GCS) into Amazon S3 bucket without downloading the files

  1. This is also the best available method to transfer data from BigQuery to Amazon S3 (through GCS)
  2. This tutorial is written for beginners with zero scripting expertise and can be utilized by anyone

Pre-Requirements

To transfer the file from Google cloud storage into Amazon S3, you will need the following

  • Access to create ‘VM Instances’ in Google Cloud (steps explained below). Pricing will be around 0.5$ per hour for a instance with 4 CPUs
  • Write access to S3 folder
  • Path of the S3 folder where the files needs to be copied and the credentials
    • Path Example                        : s3://mybucket/filetransferfrombigquery/
    • Access Key Example             : AKIAJPBXUHVICKYDEEPIA
    • Secret Access Key Example : z+9VVickyDeepi+hVaAtmbepw9gA1vjJeshX

Transferring files from Google Cloud Storage to Amazon S3

To summarize how this method works, we will be creating a virtual machine instance which will act as intermediate system for uploading and downloading using gsutil command (This will not consume any bandwidth from our system)

Step 1 : Create a VM instance and access SSH root

  1. Go to Google VM Instance and sign in with your google cloud credentials
  2. Click on ‘Create Instance’
  3. In the region and zone, select Region which is same as S3 region (Will be present in AWS URL)
  4. Based on the size of the data you want to transfer, select the number of CPUs and the memory. For transferring 40 GB data, I created v4CPUs with 30GB memory
  5. Let the instance get up and running. Now go back to the VM instances list and open SSH terminal by clicking on SSH next to the listssh_terminal_gcs_vm_instance_proanalyst
  6. This will open a terminal

Step 2: Creating Configuration File in GCS VM Instance for access

Now that we have the VM instance ready and access to the Google VM Instance, we will have to create a configuration file that will store the credentials to Amazon S3 bucket. You will by default have access to the GCS folder (if it is accessible through your account)

    • First step is to navigate to the user’s home directory (This will be default folder but let’s not leave it to option)
      [username@instancename ~]$ cd ~
    • Now create an empty .boto file where we will be storing our credentials
      [username@instancename ~]$ touch .boto
    • Now input 3 rows into the .boto file using command line (Header, Access Key and Secret Access Key
      [username@instancename ~]$ echo [Credentials] >> ~/.boto
      [username@instancename ~]$ echo aws_access_key_id = AKIAJPBXUHVICKYDEEPIA >> ~/.boto
      [username@instancename ~]$ echo aws_secret_access_key = z+9VVickyDeepi+hVaAtmbepw9gA1vjJeshX >> ~/.boto
    • Now read the file to confirm that the .boto configuration file has the right credentials
      [username@instancename ~]$ cat .boto
    • This will display the credentials as shown below the terminal. Once you confirm that the credentials are right, you are good to transfer the filesterminal_with_boto_credentials

Step 3: Transferring files using gsutil

Now that we have the credentials in place, use gsutil commandline tool to instruct files transfer from GCS to Amazon S3

    • Following command will copy all the files in the folder into the S3 bucket folder
      [username@instancename ~]$ gsutil cp -r gs://myfolder/myfilefolder/ s3://mybucket/filetransferfrombigquery/
    • If you want to copy specific files ending with a wildcard entry, you can use the following command
      [username@instancename ~]$ gsutil cp *.txt gs://myfolder/myfilefolder/ s3://mybucket/filetransferfrombigquery/

Once this is done, the SSH terminal will start showing the progress bar. And the transfer will be complete in maximum of 20 minutes. Now you have all the files in S3 folder

Do not forget to terminate the instance once you are done. Or else you will be billed as long as it is open.

Got questions? Feel free to comment below.



Author: Vignesh Kumar Sivanadan
Data scientist for a leading data sciences company. And I am a freelance tableau developer and a consultant in Upwork too. (Hire me to work for you in Upwork. Click Here). These blog posts are my experiences.

1 Comment

Leave a Reply