Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

ECS 3.6.2 Data Access Guide

PDF

Hadoop S3A for ECS

S3A is an open-source connector for Hadoop. It helps Hadoop users to address the storage scaling issues by providing a second tier of storage that is optimized for cost and capacity.

NOTE: S3A support is available on Hadoop 2.7 or later version.

Hadoop S3A allows you to connect your Hadoop cluster to any S3 compatible object store that is in the public cloud, hybrid cloud, or on-premises.

S3A performance optimization

Performance-related S3A settings are listed in the below table.

Settings Additional Information
fs.s3a.multipart.size
  • Default: 100M
  • Recommended: 150M
fs.s3a.fast.upload.active.blocks
  • Default: 4
  • Recommended: 8
fs.s3a.fast.upload.buffer
  • Default: Disk
  • Recommended: Array or bytebuffer
NOTE: Heap space that is used is fs.s3a.multipart.size * fs.s3a.fast.upload.active.block
fs.s3a.threads.max
  • Default: 10
  • Recommended: Change this to 'Between 25% and 50% of configured CPU cores.
fs.s3a.multiobjectdelete.enable
  • Default: True
  • Recommended: True or false
fs.s3a.committer.threads
  • Default: 8
  • Recommended: 8

Using magic committer

It is recommended to use the magic committer to commit data to disk in various styles and to report all test performance numbers.
NOTE: The magic committer does not support all the Hadoop tools and services. In such cases, the system uses the FileOutputCommitter automatically.

When using the magic committer:

  • Data is written directly to S3, but retargeted at the final destination.
  • Conflict is managed across the directory tree.

Configure the following S3A Hadoop parameters to use the magic committer:

  • fs.s3a.committer.magic.enabled: true
  • fs.s3a.committer.name: magic
  • mapreduce.outputcommitter.factory.scheme.s3: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory

Hadoop configuration analysis using ECS Service Console

The ECS Service Console (SC) can read and interpret your Hadoop configuration parameters with respect to connections to ECS for S3A and ViPRFS. Also, SC provides a function, Get_Hadoop_Config that reads the Hadoop cluster configuration and checks S3A and ViPRFS settings for typos, errors, and for the values that are not recommended. Contact for assistance with installing ECS SC.
 # service-console run Get_Hadoop_Config

Getting temporary credentials

You require temporary credentials to securely access storage through Hadoop S3A.

Prior to ECS IAM, Hadoop access to ECS object storage using S3A required an ECS S3 object user name and a secret key. Also, ACL level security was not possible with S3A. However, with ECS IAM and Secure Token Service (STS) features, an administrator has several, more secure options for controlling access to the S3A storage. One option is to create IAM policies that define permissions which are appropriate for the customer business case. Once the policies are in place, IAM groups can be created and attached to the policies. Individual IAM users can then be created and become members of the IAM groups. IAM users are assigned S3 access keys and secret keys that can be used to access the S3A data, relative to the IAM policy for the IAM user.

Another option for administrators is to use STS and SAML Assertions to allow federated users to obtain temporary credentials. In this use case, a cross trust relationship must be established between the ECS and the Identity Provider. Similar to the previous example, IAM policies must first be created. Once the policies are defined, the administrator creates IAM roles that are attached to the IAM policies. Federated users can then authenticate and obtain a SAML assertion from the Identity Provider. The assertion is used to assume one of the possible IAM roles that are permissible for the user. Once the role has been assumed, the Hadoop user is provided with a temporary access key, a temporary secret key, and a temporary token. The Hadoop user uses these temporary credentials to access the S3A data until the credentials expire. These temporary credentials correspond to the configured policies which enforce security controls on an S3 object store.

For more information about STS, see Secure Token Service.

The temporary credentials are passed to Hadoop using these Hadoop settings:
  • fs.s3a.access.key=ACCESS-KEY
  • fs.s3a.secret.key=SECRET-KEY
  • fs.s3a.session.token=SESSION-TOKEN
  • fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
A sample of temporary credentials provided below:
$ hdfs dfs -D fs.s3a.secret.key=SECRET-KEY -D fs.s3a.access.key=ACCESS-KEY -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider -D fs.s3a.session.token=SESSION-TOKEN -ls s3a://s3aTestBucket/test/SparkWordCount/
NOTE: The temporary credentials can last up to 12 hours.

Rate this content

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please select whether the article was helpful or not.
  Comments cannot contain these special characters: <>()\