We needed to load data into RDS on the tail end of an ETL process. The L unfortunately, was standing for "Long" instead of "Load", and that client-driven window of time that we had to load things in was being eclipsed by the rapidly growing mass of data.
AWS gives us DataPipeline for this. It's billed as:
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals.
If you access your AWS console and find DataPipeline, you'll see a nice splash page on startup that lets you configure your flows; luckily, there's one template specifically tailored to moving things from S3 to RDS. Load S3 data into RDS MySQL table
Creating Your Pipeline
So, select that template, fill out all of the fields.
One specification, under "Schedule", select "Run on pipeline activation".
When you save everything you'll get dropped onto a graphical editor. If you're like me, here's where you probably scratching your head a bit. We're going to ditch this panel for now.
Configuration Part 1 : IAM User
Hop on over to Identity & Access Management (IAM) and create a user that'll use DataPipeline. I'm going to name mine data_pipeline_agent.
After the user is created (and you've logged its Key and Secret somewhere), associate the DataPipeline-created roles to it:
- Click your user in the Users list.
- Click on the Permissions tab
- Click the Attach Policy button
- Separately, attach AWSDataPipelineRole and AmazonEC2RoleforDataPipelineRole
- Attach a third policy to let the user upload to S3 (you have one?)*
*If you haven't created an S3 policy alread, you can create one based on the example below and attach it to your user. YMMV.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1448918570000",
"Effect": "Allow",
"Action": [
"s3:GetBucketAcl",
"s3:GetBucketCORS",
"s3:GetObject",
"s3:GetObjectAcl",
"s3:GetObjectVersion",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTorrent",
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": [
"arn:aws:s3:::YOUR-BUCKET-NAME/*"
]
},
{
"Sid": "Stmt1448918570001",
"Effect": "Allow",
"Action": [
"s3:GetBucketAcl",
"s3:GetBucketCORS",
"s3:GetObject",
"s3:GetObjectAcl",
"s3:GetObjectVersion",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTorrent",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::YOUR-BUCKET-NAME"
]
}
]
}
Configuration Part 2 : PHP Project
Shell into your project folder, and issue composer init to get your composer file set up. If you haven't installed composer, you can Google composer quick-starts ad nauseam.
Next issue: composer require aws/aws-sdk-php
That'll install all of the AWS goodies that you need.
Next, create a php file to contain the script we'll write inside your project folder. Your script file should look like this:
// composer autoload
include 'vendor/autoload.php';
$expected_filename = '20160407.csv';
try
{
//
// 0. Process your CSV file here, I was dumping mine into $expected_filename
//
//
// 1. Upload the file and create the Sdk instance
//
$sdk = new Aws\Sdk([
'credentials' => [
'key' => 'IAM_KEY', // data_pipeline_agent
'secret' => 'IAM_SECRET',
],
'region' => 'us-east-1',
'version' => 'latest',
]);
$s3Client = $sdk->createS3();
$s3Client->putObject([
'Bucket' => "your-bucket-name",
'Key' => $expected_filename,
'SourceFile' => $expected_filename,
]);
//
// 2. Trigger DataPipeline
//
$client = $sdk->createDataPipeline();
$client->activatePipeline([
'pipelineId' => 'YourPipelineID',
'parameterValues' => [
[
'id' => 'myInputS3Loc',
'stringValue' => 's3://your-bucket-name/' . $expected_filename,
],
[
'id' => '*myRDSPassword',
'stringValue' => '',
],
[
'id' => 'myRDSUsername',
'stringValue' => 'your_rds_user',
],
[
'id' => 'myRDSTableInsertSql',
'stringValue' => 'INSERT INTO yourTable ( `user_id`, `type`, `time_recorded`, `sequence_number`, `station_id` ) VALUES ( ?, ?, ?, ?, ? )',
],
[
'id' => 'myRDSTableName',
'stringValue' => 'your_table_name',
],
[
'id' => 'myRDSConnectStr',
'stringValue' => 'jdbc:mysql://dbinstanceid.region.rds.amazonaws.com:3306/dbname',
],
[
'id' => 'myEc2RdsSecurityGrps',
'stringValue' => 'any_security_groups_needed',
],
],
]);
}
catch( Exception $x )
{
echo "Error: " . $x->getMessage() . "\n";
}
The snippet above is pretty self-explanatory; feel free to ask questions of course! Here are some frustrations explained:
- The config in the web GUI is artificial. Anything you pass in through the SDK will overwrite whatever you punch into the web panel. I wasted a bit of time here thinking it was a special "fallback" type of config. No niceties here, if you don't fully define the config at call-time, then the blank parameters truly become blank and the DataPipeline fails.
- The template has an asterisk in front of the RDS password.
- There's a pretty long lag between the second you use the SDK and the second at which the web GUI reacts. Be patient (though admittedly, this makes debugging a PITA).
Let me know if this helps you! I found the documents and GUI to be pretty disjoint, hopefully this glues them together for you.