Friday, November 24, 2017

Using AWS CloudWatch Logs and AWS ElasticSearch for log aggregation and visualization

If you run your infrastructure in AWS, then you can use CloudWatch Logs and AWS ElasticSearch + Kibana for log aggregation/searching/visualization as an alternative to either rolling your own ELK stack, or using a 3rd party SaaS solution such as Logentries, Loggly, Papertrail or the more expensive Splunk, Sumo Logic etc.

Here are some pointers on how to achieve this.

1) Create IAM policy and role allowing read/write access to CloudWatch logs

I created a IAM policy called cloudwatch-logs-access with the following content:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
}


Then I create an IAM role called cloudwatch-logs-role and attached the cloudwatch-logs-access policy to it.

2) Attach IAM role to EC2 instances

I attached the cloudwatch-logs-role IAM role to all EC2 instances from which I wanted to send logs to CloudWatch (I went to Actions --> Instance Settings --> Attach/Replace IAM Role and attached the role)

3) Install and configure CloudWatch Logs Agent on EC2 instances

I followed the instructions here for my OS, which is Ubuntu.

I first downloaded a Python script:

# curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O

Then I ran the script in the region where my EC2 instances are:


# python awslogs-agent-setup.py --region us-west-2 Launching interactive setup of CloudWatch Logs agent ... Step 1 of 5: Installing pip ...DONE Step 2 of 5: Downloading the latest CloudWatch Logs agent bits ... DONE Step 3 of 5: Configuring AWS CLI ... AWS Access Key ID [None]: AWS Secret Access Key [None]: Default region name [us-west-2]: Default output format [None]: Step 4 of 5: Configuring the CloudWatch Logs Agent ... Path of log file to upload [/var/log/syslog]: Destination Log Group name [/var/log/syslog]: Choose Log Stream name: 1. Use EC2 instance id. 2. Use hostname. 3. Custom. Enter choice [1]: 2 Choose Log Event timestamp format: 1. %b %d %H:%M:%S (Dec 31 23:59:59) 2. %d/%b/%Y:%H:%M:%S (10/Oct/2000:13:55:36) 3. %Y-%m-%d %H:%M:%S (2008-09-08 11:52:54) 4. Custom Enter choice [1]: 3 Choose initial position of upload: 1. From start of file. 2. From end of file. Enter choice [1]: 1 More log files to configure? [Y]:

I continued by adding more log files such as apache access and error logs, and other types of logs.

You can start/stop/restart the CloudWatch Logs agent via:

# service awslogs start

The awslogs service writes its logs in /var/log/awslogs.log and its configuration file is in /var/awslogs/etc/awslogs.conf.

4) Create AWS ElasticSearch cluster

Not much to say here. Follow the prompts in the AWS console :)

For the initial Access Policy for the ES cluster, I chose an IP-based policy and specified the source CIDR blocks allowed to connect:

 "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-west-2:accountID:domain/my-es-cluster/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "1.2.3.0/24",
            "4.5.6.7/32"
          ]
        }
      }
    }

5) Create subscription filters for streaming CloudWatch logs to ElasticSearch

First, make sure that the log files you configured with the AWS CloudWatch Log agent are indeed sent to CloudWatch. For each log file name, you should see a CloudWatch Log Group with that name, and inside the Log Group you should see multiple Log Streams, each Log Stream having the same name as the hostname sending those logs to CloudWatch.

I chose one of the Log Streams, went to Actions --> Stream to Amazon Elasticsearch Service, chose the ElasticSearch cluster created above, then created a new Lambda function to do the streaming. I had to create a new IAM role for the Lambda function. I created a role I called lambda-execution-role and associated with it the pre-existing IAM policy AWSLambdaBasicExecutionRole.

Once this Lambda function is created, subsequent log subscription filters for other Log Groups will reuse it for streaming to the same ES cluster.

One important note here is that you also need to allow the role lambda-execution-role to access the ES cluster. To do that, I modified the ES access policy and added a statement for the ARN of this role:

    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::accountID:role/lambda-execution-role"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-west-2:accountID:domain/my-es-cluster/*"
    }

6) Configure index pattern in Kibana

The last step is to configure Kibana to use the ElasticSearch index for the CloudWatch logs. If you look under Indices in the ElasticSearch dashboard, you should see indices of the form cwl-2017.11.24. In Kibana, add an Index Pattern of the form cwl-*. It should recognize the @timestamp field as the timestamp for the log entries, and create the Index Pattern correctly.

Now if you go to the Discover screen in Kibana, you should be able to visualize and search your log entries streamed from CloudWatch.

Monday, July 31, 2017

Apache 2.4 authentication and whitelisting scenarios

I have these examples scattered among many Apache installations, so I wanted to gather my notes here for my benefit, and hopefully for others as well. The following scenarios depict various requirements for Apache 2.4 authentication and whitelisting. They are all for Apache 2.4.x running on Ubuntu 14.04/16.04.

Scenario 1: block all access to Apache except to a list of whitelisted IP addresses and networks

Apache configuration snippet:

  <Directory /var/www/html/>
     IncludeOptional /etc/apache2/whitelist.conf
     Order allow,deny
     Allow from all
  </Directory>

Contents of whitelist.conf file:

# local server IPs
Require ip 127.0.0.1
Require ip 172.31.2.2

# Office network
Require ip 1.2.3.0/24

# Other IP addresses
Require ip 4.5.6.7/32
Require ip 5.6.7.8/32
etc.

Scenario 2: enable basic HTTP authentication but allow specific IP addresses through with no authentication

Apache configuration snippet:

  <Directory /var/www/html/>
     AuthType basic
     AuthBasicProvider file
     AuthName "Restricted Content"
     AuthUserFile /etc/apache2/.htpasswd

     Require valid-user
     IncludeOptional /etc/apache2/whitelist.conf
     Satisfy Any
  </Directory>

The contents of whitelist.conf are similar to the ones in Scenario 1.

Scenario 3: enable basic HTTP authentication but allow access to specific URLs with no authentication

Apache configuration snippet:

  <Directory /var/www/html/>
     Order allow,deny
     Allow from all

     AuthType Basic
     AuthName "Restricted Content"
     AuthUserFile /etc/apache2/.htpasswd

     SetEnvIf Request_URI /.well-known/acme-challenge/*  noauth=1
     <RequireAny>
       Require env noauth
       Require valid-user
     </RequireAny>
  </Directory>

This is useful when you install SSL certificates from Let's Encrypt and you need to allow the Let's Encrypt servers access to the HTTP challenge directory.

Thursday, June 01, 2017

SSL termination and http caching with HAProxy, Varnish and Apache


A common requirement when setting up a development or staging server is to try to mimic production as much as possible. One scenario I've implemented a few times is to use Varnish in front of a web site but also use SSL. Since Varnish can't handle encrypted traffic, SSL needs to be terminated before it hits Varnish. One fairly easy way to do it is using HAProxy to terminate both HTTP and HTTPS traffic, then forwarding the unencrypted traffic to Varnish, which then forwards non-cached traffic to Apache or nginx. Here are the steps to achieve this on an Ubuntu 16.04 box.

1) Install HAProxy and Varnish

# apt-get install haproxy varnish


2) Get SSL certificates from Let’s Encrypt

# wget https://dl.eff.org/certbot-auto
# chmod +x certbot-auto
# ./certbot-auto -a webroot --webroot-path=/var/www/mysite.com -d mysite.com certonly

3) Generate combined chain + key PEM file to be used by HAProxy

# cat /etc/letsencrypt/live/mysite.com/fullchain.pem /etc/letsencrypt/live/mysite.com/privkey.pem > /etc/ssl/private/mysite.com.pem

4) Configure HAProxy

Edit haproxy.cfg and add frontend sections for ports 80 and 443 + backend section pointing to varnish on port 8888

# cat /etc/haproxy/haproxy.cfg
global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # Default ciphers to use on SSL-enabled listening sockets.
        # For more information, see ciphers(1SSL). This list is from:
        #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
        ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
        ssl-default-bind-options no-sslv3
        tune.ssl.default-dh-param 2048

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

frontend www-http
   bind 172.31.8.204:80
   http-request set-header "SSL-OFFLOADED" "1"
   reqadd X-Forwarded-Proto:\ http
   default_backend varnish-backend

frontend www-https
   bind 172.31.8.204:443 ssl crt mysite.com.pem
   http-request set-header "SSL-OFFLOADED" "1"
   reqadd X-Forwarded-Proto:\ https
   default_backend varnish-backend

backend varnish-backend
   redirect scheme https if !{ ssl_fc }
   server varnish 172.31.8.204:8888 check

Enable UDP in rsyslog for haproxy logging by uncommenting 2 lines in /etc/rsyslog.conf:

# provides UDP syslog reception
module(load="imudp")
input(type="imudp" port="514")

Restart rsyslog and haproxy

# service rsyslog restart
# service haproxy restart

5) Configure varnish to listen on port 8888

Ubuntu 16.04 is using systemd for service management. You need to edit 2 files to configure the port varnish will listen on:

/lib/systemd/system/varnish.service
/etc/default/varnish

In both, set the port after the -a flag to 8888, then stop the varnish service, reload the systemctl daemon and restart the varnish service:

# systemctl stop varnish.service
# systemctl daemon-reload
# systemctl start varnish.service

By default, Varnish will send non-cached traffic to port 8080 on localhost.

6) Configure Apache or nginx to listen on 8080

For Apache, change port 80 to 8080 in all virtual hosts, and also change 80 to 8080 in /etc/apache2/ports.conf.




Thursday, March 30, 2017

Working with AWS CodeDeploy

As usual when I make a breakthrough after bumping my head against the wall for a few days trying to get something to work, I hasten to write down my notes here so I can remember what I've done ;) In this case, the head-against-the-wall routine was caused by trying to get AWS CodeDeploy to work within the regular code deployment procedures that we have in place using Jenkins and Capistrano.

Here is the 30,000 foot view of how the deployment process works using a combination of Jenkins, Docker, Capistrano and AWS CodeDeploy:
  1. Code gets pushed to GitHub
  2. Jenkins deployment job fires off either automatically (for development environments, if so desired) or manually
    • Jenkins spins up a Docker container running Capistrano and passes it several environment variables such as GitHub repository URL and branch, target deployment directory, etc.
    • The Capistrano Docker image is built beforehand and contains rake files that specify how the code it checks out from GitHub is supposed to be built
    • The Capistrano Docker container builds the code and exposes the target deployment directory as a Docker volume
    • Jenkins archives the files from the exposed Docker volume locally as a tar.gz file
    • Jenkins uploads the tar.gz to an S3 bucket
    • For good measure, Jenkins also builds a Docker image of a webapp container which includes the built artifacts, tags the image and pushes it to Amazon ECR so it can be later used if needed by an orchestration system such as Kubernetes
  3. AWS CodeDeploy runs a code deployment (via the AWS console currently, using the awscli soon) while specifying the S3 bucket and the tar.gz file above as the source of the deployment and an AWS AutoScaling group as the destination of the deployment
  4. Everybody is happy 
You may ask: why Capistrano? Why not use a shell script or some other way of building the source code into artifacts? Several reasons:
  • Capistrano is still one of the most popular deployment tools. Many developers are familiar with it.
  • You get many good features for free just by using Capistrano. For example, it automatically creates a releases directory under your target directory, creates a timestamped subdirectory under releases where it checks out the source code, builds the source code, and if everything works well creates a 'current' symlink pointing to the releases/timestamped subdirectory
  • This strategy is portable. Instead of building the code locally and uploading it to S3 for use with AWS CodeDeploy, you can use the regular Capistrano deployment and build the code directly on a target server via ssh. The rake files are the same, only the deploy configuration differs.
I am not going to go into details for the Jenkins/Capistrano/Docker setup. I've touched on some of these topics in previous posts.

I will go into details for the AWS CodeDeploy setup. Here goes.

Create IAM policies and roles

There are two roles that need to be created for AWS CodeDeploy to work. One is to be attached to EC2 instances that you want to deploy to, and one is to be used by the CodeDeploy agent running on each instance.

- Create following IAM policy for EC2 instances, which allows those instances to list S3 buckets and download fobject from S3 buckets (in this case the permissions cover all S3 buckets, but you can specify specific ones in the Resource variable):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}


- Attach above policy to an IAM role and name the role e.g. CodeDeploy-EC2-Instance-Profile

- Create following IAM policy to be used by the CodeDeploy agent running on the EC2 instances you want to deploy to:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:CompleteLifecycleAction",
        "autoscaling:DeleteLifecycleHook",
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeLifecycleHooks",
        "autoscaling:PutLifecycleHook",
        "autoscaling:RecordLifecycleActionHeartbeat",
        "autoscaling:CreateAutoScalingGroup",
        "autoscaling:UpdateAutoScalingGroup",
        "autoscaling:EnableMetricsCollection",
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribePolicies",
        "autoscaling:DescribeScheduledActions",
        "autoscaling:DescribeNotificationConfigurations",
        "autoscaling:DescribeLifecycleHooks",
        "autoscaling:SuspendProcesses",
        "autoscaling:ResumeProcesses",
        "autoscaling:AttachLoadBalancers",
        "autoscaling:PutScalingPolicy",
        "autoscaling:PutScheduledUpdateGroupAction",
        "autoscaling:PutNotificationConfiguration",
        "autoscaling:PutLifecycleHook",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DeleteAutoScalingGroup",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceStatus",
        "ec2:TerminateInstances",
        "tag:GetTags",
        "tag:GetResources",
        "sns:Publish",
        "cloudwatch:DescribeAlarms",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeInstanceHealth",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer"
      ],
      "Resource": "*"
    }
  ]
} 
- Attach above policy to an IAM role and name the role e.g. CodeDeployServiceRole

Create a 'golden image' AMI

The whole purpose of AWS CodeDeploy is to act in conjunction with Auto Scaling Groups so that the app server layer of your infrastructure becomes horizontally scalable. You need to start somewhere, so I recommend the following:
  • set up an EC2 instance for your app server the old-fashioned way, either with Ansible/Chef/Puppet or with Terraform
  • configure this EC2 instance to talk to any other layers it needs, i.e. the database layer (either running on EC2 instances or, if you are in AWS, on RDS), the caching layer (dedicated EC2 instances running Redis/memcached, or AWS ElastiCache), etc. 
  •  deploy some version of your code to the instance and make sure your application is fully functioning
 If all this works as expected, take an AMI image from this EC2 instance. This image will serve as the 'golden image' that all other instances launched by the Auto Scaling Group / Launch Configuration will be based on.

Create Application Load Balancer (ALB) and Target Group

The ALB will be the entry point into your infrastructure. For now just create an ALB and an associated Target Group. Make sure you add your availability zones into the AZ pool of the ALB.

If you want the ALB to handle the SSL certificate for your domain, add the SSL cert to Amazon Certificate Manager and add a listener on the ALB mapping port 443 to the Target Group. Of course, also add a listener for port 80 on the ALB and map it to the Target Group.

I recommend creating a dedicated Security Group for the ALB and allowing ports 80 and 443, either from everywhere or from a restricted subnet if you want to test it first.

For the Target Group, make sure you set the correct health check for your application (something like requesting a special file healthcheck.html over port 80). No need to select any EC2 instances in the Target Group yet.

Create Launch Configuration and Auto Scaling Group

Here are the main elements to keep in mind when creating a Launch Configuration to be used in conjunction with AWS CodeDeploy:
  • AMI ID: specify the AMI ID of the 'golden image' created above
  • IAM Instance Profile: specify CodeDeploy-EC2-Instance-Profile (role created above)
  • Security Groups: create a Security Group that allows access to ports 80 and 443 from the ALB Security Group above 
  • User data: each newly launched EC2 instance based on your golden image AMI will have to get the AWS CodeDeploy agent installed. Here's the user data for an Ubuntu-based AMI (taken from the AWS CodeDeploy documentation):
#!/bin/bash
apt-get -y update
apt-get -y install awscli
apt-get -y install ruby
cd /home/ubuntu
aws s3 cp s3://aws-codedeploy-us-west-2/latest/install . --region us-west-2
chmod +x ./install

./install auto

Alternatively, you can run these commands your initial EC2 instance, then take a golden image AMI based off of that instance. That way you make sure that the CodeDeploy agent will be running on each new EC2 instance that gets provisioned via the Launch Configuration. In this case, there is no need to specify a User data section for the Launch Configuration.

Once the Launch Configuration is created, you'll be able to create an Auto Scaling Group (ASG) associated with it. Here are the main configuration elements for the ASG:
  • Launch Configuration: the one defined above
  • Target Groups: the Target Group defined above
  • Min/Max/Desired: up to you to define the EC2 instance count for each of these. You can start with 1/1/1 to test
  • Scaling Policies: you can start with a static policy (corresponding to Min/Max/Desired counts) and add policies based on alarms triggered by various Cloud Watch metrics such as CPU usage, memory usage, etc as measured on the EC2 instances comprising the ASG
Once the ASG is created, depending on the Desired instance count, that many EC2 instances will be launched.

 Create AWS CodeDeploy Application and Deployment Group

We finally get to the meat of this post. Go to the AWS CodeDeploy page and create a new Application. You also need to create a Deployment Group while you are at it. For Deployment Type, you can start with 'In-place deployment' and once you are happy with that, move to 'Blue/green deployment, which is more complex but better from a high-availability and rollback perspective.

In the Add Instances section, choose 'Auto scaling group' as the tag type, and the name of the ASG created above as the key. Under 'Total matching instances' below the Tag and Key you should see a number of EC2 instances corresponding to the Desired count in your ASG.

For Deployment Configuration, you can start with the default value, which is OneAtATime, then experiment with other types such as HalfAtATime (I don't recommend AllAtOnce unless you know what you're doing)

For Service Role, you need to specify the CodeDeployServiceRole service role created above.

Create scaffoding files for AWS CodeDeploy Application lifecycle

At a minimum, the tar.gz or zip archive of your application's built code also needs to contain what is called an AppSpec file, which is a YAML file named appspec.yml. The file needs to be in the root directory of the archive. Here's what I have in mine:

version: 0.0
os: linux
files:
  - source: /
    destination: /var/www/mydomain.com/
hooks:
  BeforeInstall:
    - location: before_install
      timeout: 300
      runas: root
  AfterInstall:
    - location: after_install
      timeout: 300
      runas: root


The before_install and after_install scripts (you can name them anything you want) are shell scripts that will be executed after the archive is downloaded on the target EC2 instance.

The before_install script will be run before the files inside the archive are copied into the destination directory (as specified in the destination variable /var/www/mydomain.com). You can do things like create certain directories that need to exist, or change the ownership/permissions of certain files and directories.

The after_install script script will be run after the files inside the archive are copied into the destination directory. You can do things like create symlinks, run any scripts that need to complete the application installation (such as scripts that need to hit the database), etc.

One note specific to archives obtained from code built by Capistrano: it's customary to have Capistrano tasks create symlinks for directories such as media or var to volumes outside of the web server document root (when media files are mounted over NFS/EFS for example). When these symlinks are unarchived by CodeDeploy, they tend to turn into regular directories, and the contents of potentially large mounted file systems get copied in them. Not what you want. I ended up creating all symlinks I need in the after_install script, and not creating them in Capistrano.

There are other points in the Application deploy lifecycle where you can insert your own scripts. See the AppSpec hook documentation.


Deploy the application with AWS CodeDeploy

Once you have an Application and its associated Deployment Group, you can select this group and choose 'Deploy new revision' from the Action drop-down. For the Revision Type, choose 'My application is stored in Amazon S3'. For the Revision Location, type in the name of the S3 bucket where Jenkins uploaded the tar.gz of the application build. You can play with the other options according to the needs of your deployment.

Finally, hit the Deploy button, baby! If everything goes well, you'll see a nice green bar showing success.


If everything does not go well, you can usually troubleshoot things pretty well by looking at the logs of the Events associated with that particular Deployment. Here's an example of an error log:

ScriptFailed
Script Name after_install
Message Script at specified location: after_install run as user root failed with exit code 1 

Log Tail [stderr]chown: changing ownership of ‘/var/www/mydomain.com/shared/media/images/85.jpg’:
Operation not permitted

 
In this case, I the 'shared' directory was mounted over NFS, so I had to make sure the permissions and ownership of the source file system on the NFS server were correct.

I am still experimenting with AWS CodeDeploy and haven't quite used it 'in anger' yet, so I'll report back with any other findings.

Tuesday, January 31, 2017

Notes on setting up Elasticsearch, Kibana and Fluentd on Ubuntu

I've been experimenting with an EFK stack (with Fluentd replacing Logstash) and I hasten to write down some of my notes. I could have just as well used Logstash, but my goal is to also use the EFK stack for capturing logs out of Kubernetes clusters, and I wanted to become familiar with Fluentd, which is a Cloud Native Computing Foundation project.

1) Install Java 8

On Ubuntu 16.04:

# apt-get install openjdk-8-jre-headless

On Ubuntu 14.04:

# add-apt-repository -y ppa:webupd8team/java
# apt-get update
# apt-get -y install oracle-java8-installer

2) Download and install Elasticsearch (latest version is 5.1.2 currently)

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.2.deb
# dpkg -i elasticsearch-5.1.2.deb


Edit /etc/default/elasticsearch/elasticsearch.yml and set

network.host: 0.0.0.0

# service elasticsearch restart

3) Download and install Kibana

# wget https://artifacts.elastic.co/downloads/kibana/kibana-5.1.2-amd64.deb
# dpkg -i kibana-5.1.2-amd64.deb


Edit /etc/kibana/kibana.yml and set

server.host: "local_ip_address"


# service kibana restart

4) Install Fluentd agent (td-agent)

On Ubuntu 16.04:

# curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent2.sh | sh

On Ubuntu 14.04:

# curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-trusty-td-agent2.sh | sh


Install Fluentd elasticsearch plugin (note that td-agent comes with its own gem installer):

# td-agent-gem install fluent-plugin-elasticsearch

5) Configure Fluentd agent

To specify the Elasticsearch server to send the local logs to, use a match stanza in /etc/td-agent/td-agent.conf:

<match **>
  @type elasticsearch
  logstash_format true
  host IP_ADDRESS_OF_ELASTICSEARCH_SERVER
  port 9200
  index_name fluentd
  type_name fluentd.project.stage.web01
</match>

Note that Fluentd is backwards compatible with logstash, so if you set logstash_format true, Elasticsearch will create an index called logstash-*. Also, port 9200 needs to be open from the client to the Elasticsearch server.

I found it useful to set the type_name property to a name specific to the client running the Fluentd agent. For example, if you have several projects/tenants, each with multiple environments (dev, stage, prod) and each environment with multiple servers, you could use something like type_name fluentd.project.stage.web01. This label will then be parsed and shown in Kibana and will allow you to easily tell the source of a given log entry.

If you want Fluentd to parse Apache logs and send the log entries to Elasticsearch, use stanzas of this form in td-agent.conf:

<source>
  type tail
  format apache2
  path /var/log/apache2/mysite.com-access.log
  pos_file /var/log/td-agent/mysite.com-access.pos
  tag apache.access
</source>

<source>
  type tail
  format apache2
  path /var/log/apache2/mysite.com-ssl-access.log
  pos_file /var/log/td-agent/mysite.com-ssl-access.pos
  tag apache.ssl.access
</source>

For syslog logs, use:

<source>
  @type syslog
  port 5140
  bind 0.0.0.0
  tag system.local
</source>

Restart td-agent:

# service td-agent restart

Inspect the td-agent log file:

# tail -f /var/log/td-agent/td-agent.log

Some things I've had to do to fix errors emitted by td-agent:
  • change permissions on apache log directory and log files so they are readable by user td-agent
  • make sure port 9200 is open from the client to the Elasticsearch server

That's it in a nutshell. In the next installment, I'll show how to secure the communication between the Fluentd agent and the Elasticsearch server.

Tuesday, December 06, 2016

Using Helm to install Traefik as an Ingress Controller in Kubernetes

That was a mouthful of a title...Hope this post lives up to it :)

First of all, just a bit of theory. If you want to expose your application running on Kubernetes to the outside world, you have several choices.

One choice you have is to expose the pods running your application via a Service of type NodePort or LoadBalancer. If you run your service as a NodePort, Kubernetes will allocate a random high port on every node in the cluster, and it will proxy traffic to that port to your service. Services of type LoadBalancer are only supported if you run your Kubernetes cluster using certain specific cloud providers such as AWS and GCE. In this case, the cloud provider will create a specific load balancer resource, for example an Elastic Load Balancer in AWS, which will then forward traffic to the pods comprising your service. Either way, the load balancing you get by exposing a service is fairly crude, at the TCP layer and using a round-robin algorithm.

A better choice for exposing your Kubernetes application is to use Ingress resources together with Ingress Controllers. An ingress resource is a fancy name for a set of layer 7 load balancing rules, as you might be familiar with if you use HAProxy or Pound as a software load balancer. An Ingress Controller is a piece of software that actually implements those rules by watching the Kubernetes API for requests to Ingress resources. Here is a fragment from the Ingress Controller documentation on GitHub:

What is an Ingress Controller?

An Ingress Controller is a daemon, deployed as a Kubernetes Pod, that watches the ApiServer's /ingresses endpoint for updates to the Ingress resource. Its job is to satisfy requests for ingress.
Writing an Ingress Controller

Writing an Ingress controller is simple. By way of example, the nginx controller does the following:
  • Poll until apiserver reports a new Ingress
  • Write the nginx config file based on a go text/template
  • Reload nginx
As I mentioned in a previous post, I warmly recommend watching a KubeCon presentation from Gerred Dillon on "Kubernetes Ingress: Your Router, Your Rules" if you want to further delve into the advantages of using Ingress Controllers as opposed to plain Services.
While nginx is the only software currently included in the Kubernetes source code as an Ingress Controller, I wanted to experiment with a full-fledged HTTP reverse proxy such as Traefik. I should add from the beginning that only nginx offers the TLS feature of Ingress resources. Traefik can terminate SSL of course, and I'll show how you can do that, but it is outside of the Ingress resource spec.

I've also been looking at Helm, the Kubernetes package manager, and I noticed that Traefik is one of the 'stable' packages (or Charts as they are called) currently offered by Helm, so I went the Helm route in order to install Traefik. In the following instructions I will assume that you are already running a Kubernetes cluster in AWS and that your local kubectl environment is configured to talk to that cluster.

Install Helm

This is pretty easy. Follow the instructions on GitHub to download or install a binary for your OS.

Initialize Helm

Run helm init in order to install the server component of Helm, called tiller, which will be run as a Kubernetes Deployment in the kube-system namespace of your cluster.

Get the Traefik Helm chart from GitHub

I git cloned the entire kubernetes/charts repo, then copied the traefik directory locally under my own source code repo which contains the rest of the yaml files for my Kubernetes resource manifests.

# git clone https://github.com/kubernetes/charts.git helmcharts
# cp -r helmcharts/stable/traefik traefik-helm-chart

It is instructive to look at the contents of a Helm chart. The main advantage of a chart in my view is the bundling together of all the Kubernetes resources necessary to run a specific set of services. The other advantage is that you can use Go-style templates for the resource manifests, and the variables in those template files can be passed to helm via a values.yaml file or via the command line.

For more details on Helm charts and templates, I recommend this linux.com article.

Create an Ingress resource for your application service

I copied the dashboard-ingress.yaml template file from the Traefik chart and customized it so as to refer to my application's web service, which is running in a Kubernetes namespace called tenant1.

# cd traefik-helm-chart/templates
# cp dashboard-ingress.yaml web-ingress.yaml
# cat web-ingress.yaml
{{- if .Values.tenant1.enabled }}
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
 namespace: {{ .Values.tenant1.namespace }}
 name: {{ template "fullname" . }}-web-ingress
 labels:
   app: {{ template "fullname" . }}
   chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
   release: "{{ .Release.Name }}"
   heritage: "{{ .Release.Service }}"
spec:
 rules:
 - host: {{ .Values.tenant1.domain }}
   http:
     paths:
     - path: /
       backend:
         serviceName: {{ .Values.tenant1.serviceName }}
         servicePort: {{ .Values.tenant1.servicePort }}
{{- end }}

The variables referenced in the template above are defined in the values.yaml file in the Helm chart. I started with the variables in the values.yaml file that came with the Traefik chart and added my own customizations:

# vi traefik-helm-chart/values.yaml
ssl:
 enabled: true
acme:
 enabled: true
 email: admin@mydomain.com
 staging: false
 # Save ACME certs to a persistent volume. WARNING: If you do not do this, you will re-request
 # certs every time a pod (re-)starts and you WILL be rate limited!
 persistence:
   enabled: true
   storageClass: kubernetes.io/aws-ebs
   accessMode: ReadWriteOnce
   size: 1Gi
dashboard:
 enabled: true
 domain: tenant1-lb.dev.mydomain.com
gzip:
 enabled: false
tenant1:
 enabled: true
 namespace: tenant1
 domain: tenant1.dev.mydomain.com
 serviceName: web
 servicePort: http

Note that I added a section called tenant1, where I defined the variables referenced in the web-ingress.yaml template above. I also enabled the ssl and acme sections, so that Traefik can automatically install SSL certificates from Let's Encrypt via the ACME protocol.

Install your customized Helm chart for Traefik

With these modifications done, I ran 'helm install' to actually deploy the various Kubernetes resources included in the Traefik chart. 

I specified the directory containing my Traefik chart files (traefik-helm-chart) as the last argument passed to helm install:

# helm install --name tenant1-lb --namespace tenant1 traefik-helm-chart/
NAME: tenant1-lb
LAST DEPLOYED: Tue Nov 29 09:51:12 2016
NAMESPACE: tenant1
STATUS: DEPLOYED

RESOURCES:
==> extensions/Ingress
NAME                                  HOSTS                    ADDRESS   PORTS     AGE
tenant1-lb-traefik-web-ingress   tenant1.dev.mydomain.com             80        1s
tenant1-lb-traefik-dashboard   tenant1-lb.dev.mydomain.com             80        0s

==> v1/PersistentVolumeClaim
NAME                    STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
tenant1-lb-traefik-acme   Pending                                      0s

==> v1/Secret
NAME                            TYPE      DATA      AGE
tenant1-lb-traefik-default-cert   Opaque    2         1s

==> v1/ConfigMap
NAME               DATA      AGE
tenant1-lb-traefik   1         1s

==> v1/Service
NAME                         CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
tenant1-lb-traefik-dashboard   10.3.0.15    <none>        80/TCP    1s
tenant1-lb-traefik   10.3.0.215   <pending>   80/TCP,443/TCP   1s

==> extensions/Deployment
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
tenant1-lb-traefik   1         1         1            0           1s


NOTES:
1. Get Traefik's load balancer IP/hostname:

    NOTE: It may take a few minutes for this to become available.

    You can watch the status by running:

        $ kubectl get svc tenant1-lb-traefik --namespace tenant1 -w

    Once 'EXTERNAL-IP' is no longer '<pending>':

        $ kubectl describe svc tenant1-lb-traefik --namespace tenant1 | grep Ingress | awk '{print $3}'

2. Configure DNS records corresponding to Kubernetes ingress resources to point to the load balancer IP/hostname found in step 1

At this point you should see two Ingress resources, one for the Traefik dashboard and on for the custom web ingress resource:

# kubectl --namespace tenant1 get ingress
NAME                           HOSTS                       ADDRESS   PORTS     AGE
tenant1-lb-traefik-dashboard   tenant1-lb.dev.mydomain.com           80        50s
tenant1-lb-traefik-web-ingress tenant1.dev.mydomain.com            80        51s

As per the Helm notes above (shown as part of the output of helm install), run this command to figure out the CNAME of the AWS ELB created by Kubernetes during the creation of the tenant1-lb-traefik service of type LoadBalancer:

# kubectl describe svc tenant1-lb-traefik --namespace tenant1 | grep Ingress | awk '{print $3}'
a5be275d8b65c11e685a402e9ec69178-91587212.us-west-2.elb.amazonaws.com

Create tenant1.dev.mydomain.com and tenant1-lb.dev.mydomain.com as DNS CNAME records pointing to a5be275d8b65c11e685a402e9ec69178-91587212.us-west-2.elb.amazonaws.com.

Now, if you hit http://tenant1-lb.dev.mydomain.com you should see the Traefik dashboard showing the frontends on the left and the backends on the right:

Screen Shot 2016-11-29 at 10.54.07 AM.png
If you hit http://tenant1.dev.mydomain.com you should see your web service in action.

You can also inspect the logs of the tenant1-lb-traefik pod to see what's going on under the covers when Traefik is launched and to verify that the Let's Encrypt SSL certificates were properly downloaded via ACME:

# kubectl --namespace tenant1 logs tenant1-lb-traefik-3710322105-o2887
time="2016-11-29T00:03:51Z" level=info msg="Traefik version v1.1.0 built on 2016-11-18_09:20:46AM"
time="2016-11-29T00:03:51Z" level=info msg="Using TOML configuration file /config/traefik.toml"
time="2016-11-29T00:03:51Z" level=info msg="Preparing server http &{Network: Address::80 TLS:<nil> Redirect:<nil> Auth:<nil> Compress:false}"
time="2016-11-29T00:03:51Z" level=info msg="Preparing server https &{Network: Address::443 TLS:0xc4201b1800 Redirect:<nil> Auth:<nil> Compress:false}"
time="2016-11-29T00:03:51Z" level=info msg="Starting server on :80"
time="2016-11-29T00:03:58Z" level=info msg="Loading ACME Account..."
time="2016-11-29T00:03:59Z" level=info msg="Loaded ACME config from store /acme/acme.json"
time="2016-11-29T00:04:01Z" level=info msg="Starting provider *main.WebProvider {\"Address\":\":8080\",\"CertFile\":\"\",\"KeyFile\":\"\",\"ReadOnly\":false,\"Auth\":null}"
time="2016-11-29T00:04:01Z" level=info msg="Starting provider *provider.Kubernetes {\"Watch\":true,\"Filename\":\"\",\"Constraints\":[],\"Endpoint\":\"\",\"DisablePassHostHeaders\":false,\"Namespaces\":null,\"LabelSelector\":\"\"}"
time="2016-11-29T00:04:01Z" level=info msg="Retrieving ACME certificates..."
time="2016-11-29T00:04:01Z" level=info msg="Retrieved ACME certificates"
time="2016-11-29T00:04:01Z" level=info msg="Starting server on :443"
time="2016-11-29T00:04:01Z" level=info msg="Server configuration reloaded on :80"
time="2016-11-29T00:04:01Z" level=info msg="Server configuration reloaded on :443"

To get an even better warm and fuzzy feeling about the SSL certificates installed via ACME, you can run this command against the live endpoint tenant1.dev.mydomain.com:

# echo | openssl s_client -showcerts -servername tenant1.dev.mydomain.com -connect tenant1.dev.mydomain.com:443 2>/dev/null
CONNECTED(00000003)
---
Certificate chain
0 s:/CN=tenant1.dev.mydomain.com
  i:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3
-----BEGIN CERTIFICATE-----
MIIGEDCCBPigAwIBAgISAwNwBNVU7ZHlRtPxBBOPPVXkMA0GCSqGSIb3DQEBCwUA
-----END CERTIFICATE-----
1 s:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3
  i:/O=Digital Signature Trust Co./CN=DST Root CA X3
-----BEGIN CERTIFICATE-----
uM2VcGfl96S8TihRzZvoroed6ti6WqEBmtzw3Wodatg+VyOeph4EYpr/1wXKtx8/
KOqkqm57TH2H3eDJAkSnh6/DNFu0Qg==
-----END CERTIFICATE-----
---
Server certificate
subject=/CN=tenant1.dev.mydomain.com
issuer=/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3
---
No client certificate CA names sent
---
SSL handshake has read 3009 bytes and written 713 bytes
---
New, TLSv1/SSLv3, Cipher is AES128-SHA
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
   Protocol  : TLSv1
   Cipher    : AES128-SHA
   Start Time: 1480456552
   Timeout   : 300 (sec)
   Verify return code: 0 (ok)
etc.

Other helm commands

You can list the Helm releases that are currently running (a Helm release is a particular versioned instance of a Helm chart) with helm list:

# helm list
NAME        REVISION UPDATED                  STATUS   CHART
tenant1-lb    1        Tue Nov 29 10:13:47 2016 DEPLOYED traefik-1.1.0-a


If you change any files or values in a Helm chart, you can apply the changes by means of the 'helm upgrade' command:

# helm upgrade tenant1-lb traefik-helm-chart

You can see the status of a release with helm status:

# helm status tenant1-lb
LAST DEPLOYED: Tue Nov 29 10:13:47 2016
NAMESPACE: tenant1
STATUS: DEPLOYED

RESOURCES:
==> v1/Service
NAME               CLUSTER-IP   EXTERNAL-IP        PORT(S)          AGE
tenant1-lb-traefik   10.3.0.76    a92601b47b65f...   80/TCP,443/TCP   35m
tenant1-lb-traefik-dashboard   10.3.0.36   <none>    80/TCP    35m

==> extensions/Deployment
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
tenant1-lb-traefik   1         1         1            1           35m

==> extensions/Ingress
NAME                                  HOSTS                    ADDRESS   PORTS     AGE
tenant1-lb-traefik-web-ingress   tenant1.dev.mydomain.com             80        35m
tenant1-lb-traefik-dashboard   tenant1-lb.dev.mydomain.com             80        35m

==> v1/PersistentVolumeClaim
NAME                    STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
tenant1-lb-traefik-acme   Bound     pvc-927df794-b65f-11e6-85a4-02e9ec69178b   1Gi        RWO           35m

==> v1/Secret
NAME                            TYPE      DATA      AGE
tenant1-lb-traefik-default-cert   Opaque    2         35m

==> v1/ConfigMap
NAME               DATA      AGE
tenant1-lb-traefik   1         35m





Using AWS CloudWatch Logs and AWS ElasticSearch for log aggregation and visualization

If you run your infrastructure in AWS, then you can use CloudWatch Logs and AWS ElasticSearch + Kibana for log aggregation/searching/visuali...