S3A is an open-source connector for Hadoop. It helps Hadoop users to address the storage scaling issues by providing a second tier of storage that is optimized for cost and capacity.
NOTE: S3A support is available on Hadoop 2.7 or later version.
Hadoop S3A allows you to connect your Hadoop cluster to any S3 compatible object store that is in the public cloud, hybrid cloud, or on-premises.
S3A performance optimization
Performance-related S3A settings are listed in the below table.
Settings
Additional Information
fs.s3a.multipart.size
Default: 100M
Recommended: 150M
fs.s3a.fast.upload.active.blocks
Default: 4
Recommended: 8
fs.s3a.fast.upload.buffer
Default: Disk
Recommended: Array or bytebuffer
NOTE: Heap space that is used is
fs.s3a.multipart.size *
fs.s3a.fast.upload.active.block
fs.s3a.threads.max
Default: 10
Recommended: Change this to 'Between 25% and 50% of configured CPU cores.
fs.s3a.multiobjectdelete.enable
Default: True
Recommended: True or false
fs.s3a.committer.threads
Default: 8
Recommended: 8
Using magic committer
It is recommended to use the magic committer to commit data to disk in various styles and to report all test performance numbers.
NOTE: The magic committer does not support all the Hadoop tools and services. In such cases, the system uses the
FileOutputCommitter automatically.
When using the magic committer:
Data is written directly to S3, but retargeted at the final destination.
Conflict is managed across the directory tree.
Configure the following S3A Hadoop parameters to use the magic committer:
Hadoop configuration analysis using ECS Service Console
The ECS Service Console (SC) can read and interpret your Hadoop configuration parameters with respect to connections to ECS for S3A and ViPRFS. Also, SC provides a function,
Get_Hadoop_Config that reads the Hadoop cluster configuration and checks S3A and ViPRFS settings for typos, errors, and for the values that are not recommended. Contact for assistance with installing ECS SC.
# service-console run Get_Hadoop_Config
Getting temporary credentials
You require temporary credentials to securely access storage through Hadoop S3A.
Prior to ECS IAM, Hadoop access to ECS object storage using S3A required an ECS S3 object user name and a secret key. Also, ACL level security was not possible with S3A. However, with ECS IAM and Secure Token Service (STS) features, an administrator has several, more secure options for controlling access to the S3A storage. One option is to create IAM policies that define permissions which are appropriate for the customer business case. Once the policies are in place, IAM groups can be created and attached to the policies. Individual IAM users can then be created and become members of the IAM groups. IAM users are assigned S3 access keys and secret keys that can be used to access the S3A data, relative to the IAM policy for the IAM user.
Another option for administrators is to use STS and SAML Assertions to allow federated users to obtain temporary credentials. In this use case, a cross trust relationship must be established between the ECS and the Identity Provider. Similar to the previous example, IAM policies must first be created. Once the policies are defined, the administrator creates IAM roles that are attached to the IAM policies. Federated users can then authenticate and obtain a SAML assertion from the Identity Provider. The assertion is used to assume one of the possible IAM roles that are permissible for the user. Once the role has been assumed, the Hadoop user is provided with a temporary access key, a temporary secret key, and a temporary token. The Hadoop user uses these temporary credentials to access the S3A data until the credentials expire. These temporary credentials correspond to the configured policies which enforce security controls on an S3 object store.