Certified Kubernetes Application Developer CKAD Learning Path

Certified Kubernetes Application Developer CKAD Learning Path

Finally moving a bit away from the Clouds (AWS and GCP) and as my involvement grew more with Kubernetes, I decided to challenge myself for the Kubernetes certification. I started with the Certified Kubernetes Application Developer and am happy to share that I cleared it in the first attempt with 84%.

  • CKAD is more of an open book test, where you have access to the official Kubernetes documentation exam, but it focuses more on hands-on experience.
  • CKAD focuses on “Using a Kubernetes cluster once already provisioned
  • Unlike AWS and GCP certifications, you would be required to solve, debug actual problems and provision resources on a live Kubernetes cluster.
  • It is surely one of the most challenging exam, I have appeared for in the recent times.
  • Even though it is an open book test, you need to know where the information is.
  • Trust me, if you are not prepared this time is not going to be sufficient.

CKAD Exam Pattern and Tips

    • CKAD requires you to solve 19 questions in 2 hours.
    • CKAD exam curriculum includes these general domains and their weights on the exam:
      • 13% – Core Concepts
      • 18% – Configuration
      • 10% – Multi-Container Pods
      • 18% – Observability
      • 20% – Pod Design
      • 13% – Services & Networking
      • 8% – State Persistence
    • Exam questions can be attempted in any order and doesn’t have to be sequential.
    • Each exam question carries a weight so be sure you attempt the exams with higher weights before focusing on the lower ones. So target the ones with higher weights and quicker solutions like debugging ones.
    • 4 different K8s clusters are provisioned. Each question refers to a different kubernetes cluster, and the context needs to be switched. So be sure to execute the kubectl use context command. This command is available with every question and you just need to copy paste it.
    • Check for the namespace mentioned in the question, to find resources and create resources. Use the -n <namespace>
    • You would be performing most of the interaction from base node. However, pay attention to check for the node you need to execute the exams and make sure you return back to the base node.
    • SSH to nodes and gaining root access is allowed, if needed. Commands are provided.
    • Read carefully the Information provided within the questions with the mark. They would provide very useful hints in addressing the question and save time. for e.g. namespaces to look into. for a failed pod, what has already been created like configmap, secrets, network policies so that you do not create the same.
    • Make sure you know the imperative commands to create resources, as you won’t have time to time to create and edit yaml files. kubectl run with restart flag is is your saviour.

kubectl commands

    • If you need to edit further use –dry-run -o yaml to get a headstart yaml file and edit the same.

CKAD Learning Path

CKAD Key Topics

General information and practices

  • You can book the exam from CNCF CKAD Certification @ $300. Usually you would get discounts coupons for 15-20% on the exam.
  • Exam can be taken online from anywhere.
  • Make sure you have prepared your workspace well before the exams.
  • Make sure you have a valid government issued ID card as it would be checked.
  • You are not allowed to have anything around you and no one should enter the room.
  • Exam proctor will be watching you always, so refrain from doing any other activities. Your screen is also always shared.
  • I did not have any warnings with the Proctor, except for a request to have camera focused.
  • You would need to install a Google Chrome plugin and the exam provides a web based shell to work on which worked quite well without any glitches. Copy + Paste works fine.
  • You will have a online notepad on the right corner to note down. I hardly used it, but it can be good type and modify text instead of using VI editor.

 

AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

AWS Certified Machine Learning Specialty Certification

Finally, cleared the AWS Certified Machine Learning – Specialty (MLS-C01). It took me around four months to prepare for the exam. This was my fourth Specialty certification and in terms of the difficulty level of all of them this is the toughest, partly because I am not a machine learning expert and learned everything from basics for this certification. Machine Learning is a vast specialization in itself and with AWS services, there is lots to cover and know for the exam. This is the only exam, where the majority of the focus is on the concepts outside of AWS i.e. pure machine learning. It also includes AWS Machine Learning and Big Data services.

AWS Certified Machine Learning – Specialty (MLS-C01) exam basically validates

  •  Select and justify the appropriate ML approach for a given business problem.
  • Identify appropriate AWS services to implement ML solutions.
  • Design and implement scalable, cost-optimized, reliable, and secure ML solutions.

Refer AWS Certified Machine Learning – Specialty Exam Guide for details

                              AWS Certified Machine Learning – Specialty Domains

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Summary

  • AWS Certified Machine Learning – Specialty exam, as its name suggests, covers a lot of Machine Learning concepts right. It really digs deep into Machine learning concepts, most of which are not related to AWS.
  • AWS Certified Machine Learning – Speciality exam covers the E2E Machine Learning lifecycle, right from data collection, transformation, making it usable and efficient for Machine Learning, pre-processing data for Machine Learning, training and validation and implementation.
  • As always, one of the key tactic I followed when solving any AWS Certification exam is to read the question and use paper and pencil to draw a rough architecture and focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach to the right answer or atleast have a 50% chance of getting it right.

Preparation Summary

  • Machine Learning
    • Make sure you know and cover all the services in depth, as 60% of the exam is focused on generic Machine learning concepts not related to AWS services.
    • Know about complete generic Machine Learning lifecycle
    • Exploratory Data Analysis
      • Feature selection and Engineering
        • remove features which are not related to training
        • remove features which has same values, very low correlation, very little variance or lot of missing values
        • Apply techniques like Principal Component Analysis (PCA) for dimensionality reduction i.e reduce the number of features.
        • Apply techniques such as One-hot encoding and label encoding to help convert strings to numeric values, which are easier to process.
        • Apply Normalization i.e. values between 0 and 1 to handle data with large variance.
        • Apply feature engineering for feature reduction for e.g. using single height/weight feature instead of both the features
      • Handle Missing data
        • remove the feature or rows with missing data
        • impute using Mean/Median values – valid only for Numeric values and not categorical features also does not factor correlation between features
        • impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning – more accurate, factores correlation between features
      • Handle unbalanced data
        • Source more data
        • Oversample minority or Undersample majority
        • Data augmentation using techniques like SMOTE
    • Modeling
      • Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
        • Supervised learning trains on labelled data for e.g. Linear regression. Logistic regression, Decision trees, Random Forests
        • Unsupervised learning trains on unlabelled data for e.g. PCA, SVD, K-means
        • Reinforcement learning trained based on actions and rewards for e.g. Q-Learning
      • Hyperparameters
        • are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
        • some of the common hyperparameters are learning rate, batch, epoch (hint:  If the learning rate is too large, the minimum slope might be missed and the graph would oscillate If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient
    • Evaluation
      • Know difference in evaluating model accuracy
        • Use Area Under the (Receiver Operating Characteristic) Curve (AUC) for Binary classification
        • Use root mean square error (RMSE) metric for regression
      • Understand Confusion matrix
        • A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
        • false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
        • Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN)  (hint: use this for cases like fraud detection,  cost of marking non fraud as frauds is lower than marking fraud as non-frauds)
        • Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP)  (hint: use this for cases like videos for kids, the cost of  dropping few valid videos is lower than showing few bad ones)
      • Handle Overfitting problems
        • Simplify the model, by reducing number of layers
        • Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
        • Data Augmentation
        • Regularization – technique to reduce the complexity of the model
        • Dropout is a regularization technique that prevents overfitting
        • Never train on test data
  • AWS Machine Learning
    • SageMaker
      • Know SageMaker in depth
      • supports both File mode and Pipe mode
        • File mode loads all of the data from S3 to the training instance volumes VS Pipe mode streams data directly from S3
        • File mode needs disk space to store both the final model artifacts and the full training dataset. VS Pipe mode which helps reduce the required size for EBS volumes
      • Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it. 
      • supports Model tracking capability to manage up to thousands of machine learning model experiments
      • supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
      • supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
      • provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training & inference
      • SageMaker Automatic Model Tuning
        • is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
        • Best practices
          • limit the search to a smaller number as difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
          • DO NOT specify a very large range to cover every possible value for a hyperparameter as it affects the success of hyperparameter optimization.
          • log-scaled hyperparameter can be converted to improve hyperparameter optimization.
          • running one training job at a time achieves the best results with the least amount of compute time.
          • Design distributed training jobs so that you get they report the objective metric that you want.
        • SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
      • know how to take advantage of multiple GPUs (hint: increase learning rate and batch size w.r.t to the increase in GPUs)
      • Algorithms –
        • Blazing text provides Word2vec and text classification algorithms
        • DeepAR provides supervised learning algorithm for forecasting scalar (one-dimensional) time series (hint: train for new products based on existing products sales data)
        • Factorization machines provides supervised classification and regression tasks, helps capture interactions between features within high dimensional sparse datasets economically
        • Image classification algorithm is a supervised learning algorithm that supports multi-label classification
        • IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses
        • K-means is an unsupervised learning algorithm for clustering as it attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
        • k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression
        • Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Used to identify number of topics shared by documents within a text corpus
        • Linear models are supervised learning algorithms used for solving either classification or regression problems. 
          • For regression (predictor_type=’regressor’), the score is the prediction produced by the model.
          • For classification (predictor_type=’binary_classifier’ or predictor_type=’multiclass_classifier’)
        • Neural Topic Model (NTM) Algorithm is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
        • Object Detection algorithm detects and classifies objects in images using a single deep neural network
        • Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) (hint: dimensionality reduction)
        • Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data point (hint: anomaly detection)
        • Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. (hint: text summarization is the key use case)
    • SageMaker Ground Truth 
      • provides automated data labeling using machine learning
      • helps build highly accurate training datasets for machine learning quickly using Amazon Mechanical Turk
      • provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
      • automated data labeling uses machine learning to label portions of the data automatically without having to send them to human workers
    • Comprehend
      • natural language processing (NLP) service to find insights and relationships in text.
      • identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
    • Lex
      • provides conversational interfaces using voice and text helpful in building voice and text chatbots
    • Polly
      • text into speech
      • supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch or volume.
      • supports pronunciation lexicons to customize the pronunciation of words
    • Rekognition
      • analyze image and video
      • helps identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
    • Translate – provides natural and fluent language translation
    • Transcribe – provides speech-to-text capability
    • Elastic Interface helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference by up to 75%.
  • Analytics
    • Make sure you know and understand data engineering concepts mainly in terms of data capture, data migration, data transformation and data storage
    • Kinesis
      • Understand Kinesis Data Streams and Kinesis Data Firehose in depth
      • Kinesis Data Analytics can process and analyze streaming data using standard SQL and integrates with Data Streams and Firehose
      • Know Kinesis Data Streams vs Kinesis Firehose
        • Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
        • Know Kinesis Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
        • Kinesis Firehose works in batches with minimum 60secs interval.
        • Kinesis Data Firehose supports data transformation and record format conversion using Lambda function (hint: can be used for transforming csv or JSON into parquet)
    • Know ElasticSearch is a search service which supports indexing, full text search, faceting etc.
    • Know Data Pipeline for data transfer
    • Know Glue as fully managed ETL service
      • helps setup, orchestrate, and monitor complex data flows.
      • AWS Glue Data Catalog
        • is a central repository to store structural and operational metadata for all the data assets.
      • AWS Glue crawler
        • connects to a data store, progresses through a prioritized list of classifiers to extract the schema of the data and other statistics, and then populates the Glue Data Catalog with this metadata
  • Security, Identity & Compliance
    • Security is covered very lightly. (hint : SageMaker can read data from KMS encrypted S3. Make sure, the KMS key policies include the role attached with SageMaker)
  • Management & Governance Tools
    • Understand AWS CloudWatch for Logs and Metrics. (hint: SageMaker is integrated with Cloudwatch and logs and metrics are all stored in it)
  • Storage
    • Understand Data Storage Options – Know patterns for S3 vs RDS vs DynamoDB vs Redshift. (hint: S3 is, by default, the data storage option or Big Data storage and look for it in the answer.)

Whitepapers and articles

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Resources

AWS Certification – Machine Learning Concepts – Cheat Sheet

Machine Learning Concepts

This post covers some of the basic Machine Learning concepts mostly relevant for the AWS Machine Learning certification exam.

Machine Learning Lifecycle

Data Processing and Exploratory Analysis

  • To train a model, you need data.
  • Type of data that depends on the business problem that you want the model to solve (the inferences that you want the model to generate).
  • Process data includes data collection, data cleaning, data split, data exploring, preprocessing, transformation, formatting etc.

Feature Selection and Engineering

  • helps improve model accuracy and speed up training
  • remove irrelevant data inputs using domain knowledge for e.g. name
  • remove features which has same values, very low correlation, very little variance or lot of missing values
  • handle missing data using mean values or imputation
  • combine features which are related for e.g. height and age to height/age
  • convert or transform features to useful representation for e.g. date to day or hour
  • standardize data ranges across features

Missing Data

  • do nothing
  • remove the feature with lot of missing data points
  • remove samples with missing data, if the feature needs to be used
  • Impute using mean/median value
    • no impact and the dataset is not skewed
    • works with numerical values only. Do not use for categorical features.
    • doesn’t factor correlations between features
  • Impute using (Most Frequent) or (Zero/Constant) Values
    • works with categorical features
    • doesn’t factor correlations between features
    • can introduce bias
  • Impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning
    • more accurate than the mean, median or most frequent
    • Computationally expensive

Unbalanced Data

  • Source more real data
  • Oversampling instances of the minority class or undersampling instances of the majority class
  • Create or synthesize data using techniques like SMOTE (Synthetic Minority Oversampling TEchnique)

Label Encoding and One-hot Encoding

  • Models cannot multiply strings by the learned weights, encoding helps convert strings to numeric values.
  • Label encoding
    • Use Label encoding to provide lookup or map string data values to a numerical values
    • However, the values are random and would impact the model
  • One-hot encoding
    • Use One-hot encoding for Categorical features that have a discrete set of possible values.
    • One-hot encoding provide binary representation by converting data values into features without impacting the relationships
    • a binary vector is created for each categorical feature in the model that represents values as follows:
      • For values that apply to the example, set corresponding vector elements to 1.
      • Set all other elements to 0.
    • Multi-hot encoding is when multiple values are 1

Cleaning Data

  • Scaling or Normalization means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1)

Train a model

  • Model training includes both training and evaluating the model,
  • To train a model, algorithm is needed.
  • Data can be split into training data, validation data and test data
    • Algorithm sees and is directly influenced by the training data
    • Algorithm uses but is indirectly influenced by the validation data
    • Algorithm does not see the testing data during training
  • Training can be performed using normal parameters or features and hyperparameters

Supervised, Unsupervised and Reinforcement Learning

Splitting and Randomization

  • Always randomize the data before splitting

Hyperparameters

  • influence how the training occurs
  • Common hyperparameters are learning rate, epoch, batch size
  • Learning rate – size of the step taken during gradient descent optimization
  • Batch size
    • number of samples used to train at any one time
    • can be all (batch), one (stochastic) or some (mini batch)
    • calculable from infrastructure
  • Epochs
    • number of times the algorithm processes the entire training data
    • each epoch or run can see the model get closer to the desired state
  • depends on algorithm used

Evaluating the model

After training the model, evaluate it to determine whether the accuracy of the inferences is acceptable.

ML Model Insights

  • For binary classification models use accuracy metric called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples.
  • For regression tasks, use the industry standard root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model.

Cross-Validation

  • is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.
  • Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
  • there is no separate validation data, involves splitting the training data into chunks of validation data and use it for validation

Optimization

  • Gradient Descent is used to optimize many different types of machine learning algorithms
  • Step size sets Learning rate
    • If the learning rate is too large, the minimum slope might be missed and the graph would oscillate
    • If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient

Overfitting

  • Simplify the model, by reducing number of layers
  • Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
  • Data Augmentation
  • Regularization – technique to reduce the complexity of the model
  • Dropout is a regularization technique that prevents overfitting

Classification Model Evaluation

Confusion Matrix

  • Confusion matrix represents the percentage of times each label was predicted in the training set during evaluation
  • An NxN table that summarizes how successful a classification model’s predictions were; that is, the correlation between the label and the model’s classification.
  • One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label.
  • N represents the number of classes. In a binary classification problem, N=2
    • For example, here is a sample confusion matrix for a binary classification problem:
Tumor (predicted) Non-Tumor (predicted)
Tumor (actual) 18 (True Positives) 1 (False Negatives)
Non-Tumor (actual) 6 (False Positives) 452 (True Negatives)
    • Confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative).
    • Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).
  • Confusion matrix for a multi-class classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.

Accuracy, Precision, Recall (Sensitivity) and Specificity

Accuracy

  • A metric for classification models, that identifies fraction of predictions that a classification model got right.
  • In Binary classification, calculated as (True Positives+True Negatives)/Total Number Of Examples
  • In Multi-class classification, calculated as Correct Predictions/Total Number Of Examples

Precision

  • A metric for classification models. that identifies the frequency with which a model was correct when predicting the positive class.
  • Calculated as True Positives/(True Positives + False Positives)

Recall – Sensitivity – True Positive Rate (TPR)

  • A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify i.e. Number of correct positives out of actual positive results
  • Calculated as True Positives/(True Positives + False Negatives)
  • Important when – False Positives are acceptable as long as ALL positives are found for e.g. it is fine to predict Non-Tumor as Tumor as long as All the Tumors are correctly predicted

Specificity – True Negative Rate (TNR)

  • Number of correct negatives out of actual negative results
  • Calculated as True Negatives/(True Negatives + False Positives)
  • Important when – False Positives are unacceptable; it’s better to have false negatives for e.g. it is not fine to predict Non-Tumor as Tumor; 

ROC and AUC

ROC (Receiver Operating Characteristic) Curve

  • An ROC curve (receiver operating characteristic curve) is curve of true positive rate vs. false positive rate at different classification thresholds.
  • An ROC curve is a graph showing the performance of a classification model at all classification thresholds.
  • An ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

    ROC Curve showing TP Rate vs. FP Rate at different classification thresholds.

AUC (Area under the ROC curve)

  • AUC stands for “Area under the ROC Curve.”
  • AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
  • AUC provides an aggregate measure of performance across all possible classification thresholds.
  • One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC (Area under the ROC Curve).

F1 Score

  • F1 score (also F-score or F-measure) is a measure of a test’s accuracy.
  • It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). we.

Deploy the model

  • Re-engineer a model before integrate it with the application and deploy it.
  • Can be deployed as a Batch or as a Service

AWS SageMaker

AWS SageMaker

  • SageMaker is a fully managed machine learning service to build, train, and deploy machine learning (ML) models quickly.
  • SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.
  • SageMaker is designed for high availability with no maintenance windows or scheduled downtimes
  • SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS region to provide fault tolerance in the event of a server failure or AZ outage
  • SageMaker provides a full end-to-end workflow, but users can continue to use their existing tools with SageMaker.
  • SageMaker supports Jupyter notebooks.
  • SageMaker allows users to select the number and type of instance used for the hosted notebook, training & model hosting.

SageMaker Machine Learning

Generate example data

  • Involves exploring and preprocessing, or “wrangling,” example data before using it for model training.
  • To preprocess data, you typically do the following:
    • Fetch the data
    • Clean the data
    • Prepare or transform the data

Train a model

  • Model training includes both training and evaluating the model, as follows:
  • Training the model
    • Needs an algorithm, which depends on a number of factors.
    • Need compute resources for training.
  • Evaluating the model
    • determine whether the accuracy of the inferences is acceptable.

Training Data Format – File mode vs Pipe mode

    • Most Amazon SageMaker algorithms work best when using the optimized protobuf recordIO format for the training data.
    • Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
    • File mode loads all of the data from S3 to the training instance volumes
    • In Pipe mode, the training job streams data directly from S3.
    • Streaming can provide faster start times for training jobs and better throughput.
    • With Pipe mode, reduce the size of the EBS volumes for the training instances is also reduced Pipe mode needs only enough disk space to store your final model artifacts.
    • File mode needs disk space to store both the final model artifacts and the full training dataset.

Build Model

  • SageMaker provides several built-in machine learning algorithms that can be used for a variety of problem types
  • Write a custom training script in a machine learning framework that SageMaker supports, and use one of the pre-built framework containers to run it in SageMaker.
  • Bring your own algorithm or model to train or host in SageMaker.
    • SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training and inference
    • By using containers, machine learning algorithms can be trained and deploy models quickly and reliably at any scale.
  • Use an algorithm that you subscribe to from AWS Marketplace.

Deploy the model

  • Re-engineer a model before integrating it with application and deploy it.
  • supports both hosting services and batch transform

Hosting services

    • provides an HTTPS endpoint where the machine learning model is available to provide inferences.
    • supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
    • supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload

Batch transform

    • to inferences on entire datasets, consider using batch transform as an alternative to hosting services.

SageMaker Security

  • SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
  • SageMaker allows using encrypted S3 buckets for model artifacts and data, as well as pass a KMS key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume.
  • Requests to the SageMaker API and console are made over a secure (SSL) connection.
  • SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.

SageMaker Notebooks

  • SageMaker notebooks are collaborative notebooks that are built into SageMaker Studio that can be launched quickly.
  • can be accessed without setting up compute instances and file storage
  • charged only for the resources consumed when notebooks is running
  • instance types can be easily switching  if more or less computing power is needed, during the experimentation phase.

SageMaker Built-in Algorithms

Please refer SageMaker Built-in Algorithms for details

Elastic Inference (EI)

  • helps speed up the throughput and decrease the latency of getting real-time inferences from the deep learning models deployed as SageMaker hosted models
  • adds inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.

SageMaker Ground Truth

  • provides automated data labeling using machine learning
  • helps building highly accurate training datasets for machine learning quickly.
  • offers easy access to labelers through Amazon Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
  • allows using your own labelers or using vendors recommended by Amazon through AWS Marketplace.
  • helps lower the labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
  • significantly reduces the time and effort required to create datasets for training to reduce costs
  • provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.

  • first selects a random sample of data and sends it to Amazon Mechanical Turk to be labeled.
  • results are then used to train a labeling model that attempts to label a new sample of raw data automatically.
  • labels are committed when the model can label the data with a confidence score that meets or exceeds a threshold you set.
  • for confidence score falling below the defined threshold, the data is sent to human labelers.
  • Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy.
  • process repeats with each sample of raw data to be labeled.
  • labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans.

SageMaker Automatic Model Training

  • Hyperparameters are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
  • Automatic model tuning is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
  • Best Practices for Hyperparameter tuning
    • Choosing the Number of Hyperparameters – limit the search to a smaller number as difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
    • Choosing Hyperparameter RangesDO NOT specify a very large range to cover every possible value for a hyperparameter. Range of values for hyperparameters that you choose to search can significantly affect the success of hyperparameter optimization.
    • Using Logarithmic Scales for Hyperparameters – log-scaled hyperparameter can be converted to improve hyperparameter optimization.
    • Choosing the Best Number of Concurrent Training Jobsrunning one training job at a time achieves the best results with the least amount of compute time.
    • Running Training Jobs on Multiple Instances – Design distributed training jobs so that you get they report the objective metric that you want.

SageMaker Neo

  • SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
  • Automatically optimizes models built with popular deep learning frameworks that can be used to deploy on multiple hardware platforms.
  • Optimized models run up to two times faster and consume less than a tenth of the resources of typical machine learning models.

SageMaker Pricing

  • Users pay for ML compute, storage and data processing resources their use for hosting the notebook, training the model, performing predictions & logging the outputs.

AWS Certification – Machine Learning Services – Cheat Sheet

Amazon SageMaker

  • Build, train, and deploy machine learning models at scale
  • fully-managed service that enables data scientists and developers to quickly and easily build, train & deploy machine learning models.
  • enables developers and scientists to build machine learning models for use in intelligent, predictive apps.
  • is designed for high availability with no maintenance windows or scheduled downtimes.
  • allows users to select the number and type of instance used for the hosted notebook, training & model hosting.
  • can be deployed as endpoint interfaces and batch.
  • supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
  • supports Jupyter notebooks.
  • Users can persist their notebook files on the attached ML storage volume.
  • Users can modify the notebook instance and select a larger profile through the SageMaker console, after saving their files and data on the attached ML storage volume.
  • includes built-in algorithms for linear regression, logistic regression, k-means clustering, principal component analysis, factorization machines, neural topic modeling, latent dirichlet allocation, gradient boosted trees, seq2seq, time series forecasting, word2vec & image classification
  • algorithms work best when using the optimized protobuf recordIO format for the training data, which allows Pipe mode that streams data directly from S3 and helps faster start times and reduce space requirements
  • provides built-in algorithms, pre-built container images, or extend a pre-built container image and even build your custom container image.
  • supports users custom training algorithms provided through a Docker image adhering to the documented specification.
  • also provides optimized MXNet, Tensorflow, Chainer & PyTorch containers
  • ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
  • requests to the API and console are made over a secure (SSL) connection.
  • stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
  • SageMaker Neo is a new capability that enables machine learning models to train once and run anywhere in the cloud and at the edge.

Amazon Comprehend

  • is a managed natural language processing (NLP) service to find insights and relationships in text.
  • identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
  • can analyze a collection of documents and other text files (such as social media posts) and automatically organize them by relevant terms or topics.

Amazon Lex

  • is a service for building conversational interfaces using voice and text.
  • provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable building applications with highly engaging user experiences and lifelike conversational interactions.
  • common use-cases of Lex include: Application/Transactional bot, Informational bot,  Enterprise Productivity bot and Device Control bot.
  • leverages Lambda for Intent fulfillment, Cognito for user authentication & Polly for text to speech.
  • scales to customers needs and does not impose bandwidth constraints.
  • is a completely managed service so users don’t have to manage scaling of resources or maintenance of code.
  • uses deep learning to improve over time.

Amazon Polly

  • text into speech
  • uses advanced deep learning technologies to synthesize speech that sounds like a human voice.
  • supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch or volume.

Amazon Rekognition

  • analyze image and video
  • identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
  • provides highly accurate facial analysis and facial search capabilities that you can use to detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.
  • helps identify potentially unsafe or inappropriate content across both image and video assets and provides detailed labels that allow to accurately control what you want to allow based on your needs.

Amazon SageMaker Ground Truth

  • helps build highly accurate training datasets for machine learning quickly.
  • offers easy access to labelers through Amazon Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
  • allows using your own labelers or use vendors recommended by Amazon through AWS Marketplace.
  • helps lower labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
  • provides annotation consolidation to help improve the accuracy of the data object’s labels.

Amazon Translate

  • provides natural and fluent language translation
  • is a neural machine translation service that delivers fast, high-quality, and affordable language translation.
  • Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms.
  • allows content localization – such as websites and applications – for international users, and to easily translate large volumes of text efficiently.

Amazon Transcribe

  • provides speech-to-text capability
  • uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately.
  • can be used to transcribe customer service calls, to automate closed captioning and subtitling, and to generate metadata for media assets to create a fully searchable archive
  • adds punctuation and formatting so that the output closely matches the quality of manual transcription at a fraction of the time and expense.
  • process audio in batch or in near real-time
  • supports custom vocabulary to generate more accurate transcriptions for domain-specific words and phrases like product names, technical terminology, or names of individuals.
  • specify a list of words to remove from transcripts

Amazon Elastic Inference

  • helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference by up to 75%.
  • supports TensorFlow, Apache MXNet, and ONNX models, with more frameworks coming soon.

AWS SageMaker Built-in Algorithms Summary

SageMaker Built-in Algorithms

BlazingText algorithm

    • provides highly optimized implementations of the Word2vec and text classification algorithms.
    • Word2vec algorithm
      • useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
      • maps words to high-quality distributed vectors, whose representation is called word embeddings
      • word embeddings capture the semantic relationships between words.
    • Text classification
      • is an important task for applications performing web searches, information retrieval, ranking, and document classification
    • provides the Skip-gram and continuous bag-of-words (CBOW) training architectures

DeepAR forecasting algorithm

    • is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
    • use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

Factorization machine

    • is a general-purpose supervised learning algorithm used for both classification and regression tasks.
    • extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically

Image classification algorithm

    • a supervised learning algorithm that supports multi-label classification
    • takes an image as input and outputs one or more labels
    • uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
    • recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.

IP Insights

    • is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
    • designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers

K-means algorithm

    • is an unsupervised learning algorithm for clustering
    • attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

K-nearest neighbors (k-NN) algorithm

    • is an index-based algorithm.
    • uses a non-parametric method for classification or regression.
    • For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
    • For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Latent Dirichlet Allocation (LDA) algorithm

    • is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
    • used to discover a user-specified number of topics shared by documents within a text corpus.

Linear Learner

    • are supervised learning algorithms used for solving either classification or regression problems

Neural Topic Model (NTM) Algorithm

    • is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
    • Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Object2Vec algorithm

    • is a general-purpose neural embedding algorithm that is highly customizable
    • can learn low-dimensional dense embeddings of high-dimensional objects.

Object Detection algorithm

    • detects and classifies objects in images using a single deep neural network.
    • is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

Principal Component Analysis

    • is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible

Random Cut Forest (RCF)

    • is an unsupervised algorithm for detecting anomalous data points within a data set.

Semantic segmentation algorithm

    • provides a fine-grained, pixel-level approach to developing computer vision applications

SageMaker Sequence to Sequence (seq2seq)

    • is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
    • key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)

XGBoost (eXtreme Gradient Boosting)

    • is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
    • Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models

AWS Certified Big Data -Speciality (BDS-C00) Exam Learning Path

Clearing the AWS Certified Big Data – Speciality (BDS-C00) was a great feeling. This was my third Speciality certification and in terms of the difficulty level (compared to Network and Security Speciality exams), I would rate it between Network (being the toughest) Security (being the simpler one).

Big Data in itself is a very vast topic and with AWS services, there is lots to cover and know for the exam. If you have worked on Big Data technologies including a bit of Visualization and Machine learning, it would be a great asset to pass this exam.

AWS Certified Big Data – Speciality (BDS-C00) exam basically validates

  • Implement core AWS Big Data services according to basic architectural best practices
  • Design and maintain Big Data
  • Leverage tools to automate Data Analysis

Refer AWS Certified Big Data – Speciality Exam Guide for details

                              AWS Certified Big Data – Speciality Domains

AWS Certified Big Data – Speciality (BDS-C00) Exam Summary

  • AWS Certified Big Data – Speciality exam, as its name suggests, covers a lot of Big Data concepts right from data transfer and collection techniques, storage, pre and post processing, analytics, visualization with the added concepts for data security at each layer.
  • One of the key tactic I followed when solving any AWS Certification exam is to read the question and use paper and pencil to draw a rough architecture and focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach to the right answer or atleast have a 50% chance of getting it right.
  • Be sure to cover the following topics
    • Whitepapers and articles
    • Analytics
      • Make sure you know and cover all the services in depth, as 80% of the exam is focused on these topics
      • Elastic Map Reduce
        • Understand EMR in depth
        • Understand EMRFS (hint: Use Consistent view to make sure S3 objects referred by different applications are in sync)
        • Know EMR Best Practices (hint: start with many small nodes instead on few large nodes)
        • Know Hive can be externally hosted using RDS, Aurora and AWS Glue Data Catalog
        • Know also different technologies
          • Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources
          • D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS
          • Spark is a distributed processing framework and programming model that helps do machine learning, stream processing, or graph analytics using Amazon EMR clusters
          • Zeppelin/Jupyter as a notebook for interactive data exploration and provides open-source web application that can be used to create and share documents that contain live code, equations, visualizations, and narrative text
          • Phoenix is used for OLTP and operational analytics, allowing you to use standard SQL queries and JDBC APIs to work with an Apache HBase backing store
      • Kinesis
        • Understand Kinesis Data Streams and Kinesis Data Firehose in depth
        • Know Kinesis Data Streams vs Kinesis Firehose
          • Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
          • Know Kineses Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
          • Kinesis Firehose works in batches with minimum 60secs interval.
        • Understand Kinesis Encryption (hint: use server side encryption or encrypt in producer for data streams)
        • Know difference between KPL vs SDK (hint: PutRecords are synchronously, while KPL supports batching)
        • Kinesis Best Practices (hint: increase performance increasing the shards)
      • Know ElasticSearch is a search service which supports indexing, full text search, faceting etc.
      • Redshift
        • Understand Redshift in depth
        • Understand Redshift Advance topics like Workload Management, Distribution Style, Sort key
        • Know Redshift Best Practices w.r.t selection of Distribution style, Sort key, COPY command which allows parallelism
        • Know Redshift views to control access to data.
      • Amazon Machine Learning
      • Know Data Pipeline for data transfer
      • QuickSight
      • Know Glue as the ETL tool
    • Security, Identity & Compliance
    • Management & Governance Tools
      • Understand AWS CloudWatch for Logs and Metrics. Also, CloudWatch Events more real time alerts as compared to CloudTrail
    • Storage
    • Compute
      • Know EC2 access to services using IAM Role and Lambda using Execution role.
      • Lambda esp. how to improve performance batching, breaking functions etc.

AWS Certified Big Data – Speciality (BDS-C00) Exam Resources

AWS Data Transfer Services

AWS Data Transfer Services

  • AWS provides a suite of data transfer services that includes many methods that to migrate your data more effectively.
  • Data Transfer services work both Online and Offline and the usage depends on several factors like amount of data, time required, frequency, available bandwidth and cost.
  • Online data transfer and hybrid cloud storage
    • A network link to the VPC, transfer data to AWS, or use S3 for hybrid cloud storage with an existing on-premises applications.
    • helps both to lift and shift large datasets once, as well as help you integrate existing process flows like backup and recovery or continuous data streams directly with cloud storage.
  • Offline data migration to Amazon S3.
    • use shippable, ruggedized devices are ideal for moving large archives, data lakes, or in situations where bandwidth and data volumes cannot pass over your networks within your desired time frame.

Online data transfer

VPN

  • connect securely between data centers and AWS
  • quick to setup and cost efficient
  • ideal for small data transfers and connectivity
  • not reliable as still uses shared Internet connection

Direct Connect

  • provides dedicated physical connection to accelerate network transfers between data centers and AWS
  • provides reliable data transfer
  • ideal for regular large data transfer
  • needs time to setup
  • is not a cost efficient solution
  • can be secured using VPN over Direct Connect

AWS S3 Transfer Acceleration

  • makes public Internet transfers to S3 faster.
  • helps maximize the available bandwidth regardless of distance or varying Internet weather, and there are no special clients or proprietary network protocols.  Simply change the endpoint you use with your S3 bucket and acceleration is automatically applied.
  • ideal for recurring jobs that travel across the globe, such as media uploads, backups, and local data processing tasks that are regularly sent to a central location

AWS DataSync

  • automates moving data between on-premises storage and S3 or Elastic File System (Amazon EFS).
  • automatically handles many of the tasks related to data transfers that can slow down migrations or burden the IT operations, including running your own instances, handling encryption, managing scripts, network optimization, and data integrity validation.
  • helps transfer data at speeds up to 10 times faster than open-source tools.
  • uses AWS Direct Connect or internet links to AWS and ideal for one-time data migrations, recurring data processing workflows, and automated replication for data protection and recovery.

Offline data transfer

AWS Snowball

  • is a petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data into and out of AWS.
  • ideal for one time large data transfers with limited network bandwidth, long transfer times, and security concerns
  • is simple, fast, and secure.
  • can be very cost and time efficient for large data transfer

AWS Snowball Edge

  • is a petabyte to exabytes scale data transfer device with on-board storage and compute capabilities
  • move large amounts of data into and out of AWS, as a temporary storage tier for large local datasets, or to support local workloads in remote or offline locations.
  • ideal for one time large data transfers with limited network bandwidth, long transfer times, and security concerns
  • is simple, fast, and secure.
  • can be very cost and time efficient for large data transfer

AWS Snowmobile

  • is an exabyte-scale data transport solution that uses a secure semi 40-foot shipping container to transfer large amounts of data into and out of AWS.
  • addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns.
  • transfer done through through a custom engagement, is fast, secure, and can be as little as one-fifth the cost of high-speed Internet.

Data Transfer Chart – Bandwidth vs Time

Data Migration Speeds

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An organization is moving non-business-critical applications to AWS while maintaining a mission critical application in an on-premises data center. An on-premises application must share limited confidential information with the applications in AWS. The Internet performance is unpredictable. Which configuration will ensure continued connectivity between sites MOST securely?
    1. VPN and a cached storage gateway
    2. AWS Snowball Edge
    3. VPN Gateway over AWS Direct Connect
    4. AWS Direct Connect
  2. A company wants to transfer petabyte scale of data to AWS for their analytics, however are constrained on their internet connectivity? Which AWS service can help them transfer the data quickly?
    1. S3 enhanced uploader
    2. Snowmobile
    3. Snowball
    4. Direct Connect
  3. A company wants to transfer their video library data, which runs in exabyte, to AWS. Which AWS service can help the company transfer the data?
    1. Snowmobile
    2. Snowball
    3. S3 upload
    4. S3 enhanced uploader
  4. You are working with a customer who has 100 TB of archival data that they want to migrate to Amazon Glacier. The customer has a 1-Gbps connection to the Internet. Which service or feature provides the fastest method of getting the data into Amazon Glacier?
    1. Amazon Glacier multipart upload
    2. AWS Storage Gateway
    3. VM Import/Export
    4. AWS Snowball

AWS Redshift Advanced

AWS Redshift Advanced

AWS Redshift Advanced topics cover Distribution Styles for table, Workload Management etc.

Distribution Styles

  • Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.

KEY distribution

  • A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
  • As a rule of thumb you should choose a column that:
    • Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
    • acts as a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.

EVEN distribution

  • distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
  • Choose EVEN distribution
    • when the table does not participate in joins
    • when there is not a clear choice between KEY and ALL distribution.

ALL distribution

  • whole table is replicated in every compute node.
  • ensures that every row is collocated for every join that the table participates in
  • ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
  • Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.

AUTO distribution

  • Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
  • Amazon Redshift applies AUTO distribution, be default.

Sort Key

  • Sort keys define the order data in which the data will be stored.
  • Sorting enables efficient handling of range-restricted predicates
  • Only one sort key per table can be defined, but it can be composed with one or more columns.
  • Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
  • The are two kinds of sort keys in Redshift: Compound and Interleaved.

Compound Keys

  • A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
  • A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
  • Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.

Interleaved Sort Keys

  • An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
  • An interleaved sort key is more efficient when multiple queries use different columns for filters
  • Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
  • Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.

Constraints

  • Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
  • Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
  • Redshift does enforce NOT NULL column constraints.

Redshift Workload Management

  • Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
  • Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
    • Slots: number of concurrent queries that can be executed in this queue.
    • Working memory: percentage of memory assigned to this queue.
    • Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
  • Queries can be routed to different queues using Query Groups and User Groups. As a rule of thumb, is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
  • By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
    1. Create custom table views.
    2. Add interleaved sort keys per team.
    3. Maintain team-specific copies of the table.
    4. Add support for workload management queue hopping.

AWS Redshift Best Practices

AWS Redshift Best Practices

Designing Tables

Sort Key Selection

  • Redshift stores the data on disk in sorted order according to the sort key, which helps query optimizer to determine optimal query plans.
  • If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
    • Queries are more efficient because they can skip entire blocks that fall outside the time range.
  • If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
    • Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
  • If you frequently join a table, specify the join column as both the sort key and the distribution key.
    • Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Distribution Style selection

  • Distribute the fact table and one dimension table on their common columns.
    • Your fact table can have only one distribution key. Any tables that join on another key aren’t collocated with the fact table.
    • Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows.
    • Designate both the dimension table’s primary key and the fact table’s corresponding foreign key as the DISTKEY.
  • Choose the largest dimension based on the size of the filtered dataset.
    • Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
  • Choose a column with high cardinality in the filtered result set.
    • If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal.
    • However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
  • Change some dimension tables to use ALL distribution.
    • If a dimension table cannot be collocated with the fact table or other important joining tables, query performance can be improved significantly by distributing the entire table to all of the nodes.
    • Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations.

Other Practices

  • Automatic compression produces the best results
  • COPY command analyzes the data and applies compression encodings to an empty table automatically as part of the load operation
  • Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
  • Don’t use the maximum column size for convenience.

Loading Data

  • You can load data into the tables using the three following methods:
    • Using Multi-Row INSERT
    • Using Bulk INSERT
    • Using COPY command
    • Staging tables
  • Copy Command
    • COPY command loads data in parallel from S3, EMR, DynamoDB, or multiple data sources on remote hosts.
    • COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
    • Use a Single COPY Command to Load from Multiple Files
    • DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
  • Split the Load Data into Multiple Files
    • divide the data in multiple files with equal size ( between 1MB and 1GB
    • number of files to be a multiple of the number of slices in the cluster
    • helps to distribute workload uniformly in the cluster.
  • Use a Manifest File
    • S3 provides eventual consistency for some operations, so it is possible that new data will not be available immediately after the upload, which could result in an incomplete data load or loading stale data.
    • Data consistency can be managed using a manifest file to load data.
    • Manifest file helps specify different S3 locations in a more efficient way that with the use of S3 prefixes.
  • Compress Your Data Files
    • Individually compress the load files using gzip, lzop, bzip2, or Zstandard for large datasets
    • Avoid to use compression if you have small amount of data because the benefit of compression would be outweighed by the processing cost of decompression.
    • If the priority is to reduce the time spent by COPY commands use LZO compression. In the other hand if the priority is to reduce the size of the files in S3 and the network bandwidth use BZ2 compression.
  • Load Data in Sort Key Order
    • Load your data in sort key order to avoid needing to vacuum.
    • As long as each batch of new data follows the existing rows in the table, the data will be properly stored in sort order, and you will not need to run a vacuum.
    • Presorting rows is not needed in each load because COPY sorts each batch of incoming data as it loads.
  • Load Data using IAM role

Designing Queries

  • Avoid using select *. Include only the columns you specifically need.
  • Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
  • Don’t use cross-joins unless absolutely necessary
  • Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
  • Use predicates to restrict the dataset as much as possible.
  • In the predicate, use the least expensive operators that you can.
  • Avoid using functions in query predicates.
  • If possible, use a WHERE clause to restrict the dataset.
  • Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
    1. When the tables are highly denormalized and do NOT participate in frequent joins.
    2. When data must be grouped based on a specific key on a defined slice.
    3. When data transfer between nodes must be eliminated.
    4. When a new table has been loaded and it is unclear how it will be joined to dimension.
  2. An administrator has a 500-GB file in Amazon S3. The administrator runs a nightly COPY command into a 10-node Amazon Redshift cluster. The administrator wants to prepare the data to optimize performance of the COPY command. How should the administrator prepare the data?
    1. Compress the file using gz compression.
    2. Split the file into 500 smaller files.
    3. Convert the file format to AVRO.
    4. Split the file into 10 files of equal size.