• SAS Notes

    **https://www.sascrunchtraining.com*;
    Data Test;
    a = 1;
    Run;

    Data Number;
    Var1 = 123;
    Var2 = 356;
    Var3 = 923;
    Run;

    /*Input variable*/
    Data Food;
    Restaurant = ‘Burger King’;
    NumEmploy = 5;
    Location = ‘Toronto’;
    Run;

    /*Multiple vars*/
    Data SCORE;
    input Score1 Score2 Score3;
    Datalines;
    77 88 35
    93 57 74
    67 85 71
    ;
    Run;

    /*Char vs. Num in datalines*/
    /*example of wrong codes*/
    Data Test;
    Input Name Gender Age;
    Datalines;
    John Male 31
    Mary Female 23
    Mark Male 45
    ;
    Run;
    /*Name and Gender will be NA*/

    /*dolalr sign ($) is needed for creating char vars with datalines*/
    /*dollar signs need to be after the var name*/
    Data Test;
    Input Name $ Gnder $ Age;
    Datalines;
    John Male 31
    Mary Female 23
    Mark Male 45
    ;
    Run;

    Data Profile;
    Input PAT_ID $ Enrol $ Bscore;
    Datalines;
    P001 Yes 99
    P002 Yes 101
    P003 No 125
    ;
    Run;
    /*the input of datalines should straightly be following the data table, not the variable*/

    /*length limit of characters*/
    Data Test;
    Input Name $ Gender $;
    Datalines;
    Christopher Male
    Elizabeth Female
    MacDonald Male
    ;
    Run;

    **If not pre-denfined, the default length of a char var in SAS is 8. Anything longer than that will be cut off.;
    /*Thus in the example above, the names will be trauncated. */
    Data Test;
    Length Name $12;
    Input Name $ Gender $;
    Datalines;
    Christopher Male
    Elizabeth Female
    MacDonald Male
    ;
    Run;
    /*the length definition should be before input statement or it won’t work – think about the sas execution logic*/

    Data Profile;
    Length PAT_ID $10;
    Input PAT_ID $ Enrol $ Bscore;
    Datalines;
    PAT3000001 Yes 99
    PAT3000002 Yes 101
    PAT3000003 No 125
    ;
    Run;
    /*view your output to check, in case you miss out char definition / length definition*/

    /*set dataset – set statement is used to copy a data set into another*/
    Data Test2;
    Set Test;
    Run;
    /*subset a dataset */
    Data Test2;
    Set Test;
    If Gender = ‘Male’;
    Run;
    **You can use both Where and If to subset the dataset. ‘Where’ is more efficient when operating on larger data as it directly subset
    from raw data. ‘If’ will get all the data and then filter. However, ‘If’ can tell you how many variables are in the original dataset
    vs. current data set to give you a sense of how many have been dropped;
    Data Test3;
    Set Test;
    Where Gender = ‘Male’;
    Run;

    Data Enrol;
    Set Profile;
    If Enrol = ‘Yes’;
    Run;

    /*accessing other libraries*/
    Data Cars;
    Set SASHELP.Cars;
    Run;

    Data JanJol;
    Set SASHELP.Holiday;
    If month = 1;
    Run;

    /*accessing your own sas datasets*/
    Libname Proj1 “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders”;
    Libname Proj2 “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\DS1”;

    /*copy out a dataset from library*/
    Data Income;
    Set Proj2.Income;
    Run;
    /*save dataset in work to folder*/
    Data Proj1.CP951;
    Set SASHELP.CP951;
    Run;

    /*export sas data file to excel*/
    Proc Export Data = Income
    OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Income.csv”
    Replace
    Dbms = csv;
    Run;
    /*replace option allows you to overwrite the file if it’s already exsisted in the directory path. we should always include whit option*/
    /*dbms specifies the file type for exporting*/
    Proc Export Data = SASHelp.Electric
    OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Electric.csv”
    Replace
    Dbms = csv;
    Run;
    /*need to check why can only export csv but not xls/xslx*/

    /*exporting data to text file*/
    Proc Export Data = Income
    OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Income.txt”
    Replace
    Dbms = tab;
    Run;
    /*for text, it’s xxx.txt dbms = tab*/

    /*export can also be done by right clicking the data file and define the path to export*/

    /*Exercise 1*/
    Data Rate;
    Bank = ‘Bank of America’;
    Rate = 3.23;
    Year = 5;
    Run;
    /*every var need to end with a semicolon*/

    /*Exercise 2*/
    Data Demo;
    Input ID $ Race $ Gender $ Age;
    Datalines;
    APP001 Black Male 30
    APP002 Caucasian Female 25
    APP003 Asian Male 24
    APP004 Black Female 32
    APP005 Caucasian Female 26
    ;
    Run;
    /*Semicolon is needed after Dataline statement*/

    /*Exercise 3*/
    Data Perch;
    Set SASHelp.Fish;
    If Species = ‘Perch’;
    Run;

    /*Coding Project 1*/
    /*data import*/
    Proc Import datafile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\kpi.txt’
    out = kpi
    dbms = tab
    replace;
    delimiter = ‘ ‘;
    Run;
    /*another way of importing with datalines*/
    Data kpi2;
    Input Store $ Revenue Staff Salary Operation Profit Complaint Turnover;
    Datalines;
    STORE101 128000 18 29200 15200 83600 5 2
    STORE102 158000 17 19000 12000 127000 11 2
    STORE103 138000 18 26300 10500 101200 7 1
    STORE104 101000 17 19700 19700 61600 5 2
    STORE105 123000 15 29500 10400 83100 7 1
    STORE106 189000 13 24400 12600 152000 5 2
    STORE107 135000 10 24800 11900 98300 5 2
    STORE108 130000 14 19400 11000 99600 3 1
    STORE109 191000 12 28300 10500 152200 8 2
    STORE110 176000 10 23500 15900 136600 9 1
    ;
    Run;
    /*subset*/
    Data Gkpi;
    Set Kpi;
    If Revenue / Staff > 10000;
    Run;

    /*math operations*/
    Data Numbers;
    Input Val1 Val2;
    Datalines;
    1 26
    4 49
    17 45
    13 32
    11 22
    34 43
    25 25
    0 43
    19 23
    23 33
    ;
    Run;

    Data Math1;
    Set Numbers;
    Val3 = Val1 + Val2;
    Val4 = Val1 – Val2;
    Run;

    Data Math2;
    Set Numbers;
    Val3 = Val1 * Val2;
    Val4 = Val1 / Val2;
    Run;
    /*ctrl + f4 to close the active window*/

    Data Accounting;
    Input EmployeeID $ Base OTHours Rate;
    Datalines;
    ID0001 6300 4 40
    ID0002 4500 8 35
    ID0003 8800 5 45
    ID0004 8900 4 60
    ID0005 5600 3 55
    ID0006 6000 9 45
    ID0007 7600 2 35
    ;
    Run;

    Data Salary;
    Set Accounting;
    Salary = Base + OTHours * Rate;
    Run;
    /*Sum function in SAS instead of +*/
    Data Numbers;
    Input Val0 Val1 Val2;
    Datalines;
    4 1 26
    1 4 49
    9 . 45
    23 13 .
    48 11 22
    . 34 43
    19 25 25
    7 0 .
    17 10 23
    16 23 33
    ;
    Run;

    Data Math1;
    Set Numbers;
    Val3 = Val0 + Val1 + Val2;
    Run;

    /*When using +, as long as one of the vars has na, the summed value is na.*/
    /*The plus sign fails to add up numbers with the presence of missing value*/
    Data Math2;
    Set Numbers;
    Val3 = Sum (of Val0, Val1, Val2);
    Run;
    /*The sum function will ignore any missing value when adding up the numbers*/

    Data Sales;
    Input StoreID $ Mon Tue Wed Thr Fri;
    Datalines;
    STR001 890 750 760 950 1150
    STR002 650 450 520 760 970
    STR003 1420 1250 1300 1425 1890
    STR004 830 . 710 810 1040
    STR005 760 550 780 980 1230
    STR006 430 530 560 . 970
    STR007 1010 980 . 890 1230
    STR008 440 740 560 640 740
    STR009 670 690 730 950 870
    STR0010 570 790 820 810 640
    ;
    Run;

    Data TotalSales;
    Set Sales;
    TotalSales = Sum (of Mon, Tue, Wed, Thr, Fri);
    Run;

    /*mean, min, max with SAS*/
    Data Numbers;
    Input Val0 Val1 Val2;
    Datalines;
    4 1 26
    1 4 49
    9 . 45
    23 13 .
    48 11 22
    . 34 43
    19 25 25
    7 0 .
    17 10 23
    16 23 33
    ;
    Run;
    Data Math;
    Set Numbers;
    m1 = Mean(val0, val1, val2);
    m2 = Min(val0, val1, val2);
    m3 = Max(val0, val1, val2);
    Run;

    Data Exam;
    Input ID $ Test1 Test2 Final;
    Datalines;
    9000001 90 87 66
    9000002 66 89 98
    9000003 20 45 73
    9000004 75 74 100
    9000005 44 85 76
    9000006 71 82 89
    9000007 82 83 88
    9000008 68 71 54
    9000009 36 48 46
    9000010 71 73 98
    ;
    Run;

    Data FinalScore;
    Set Exam;
    FinalScore = 0.5 * max(Test1, Test2) + 0.5 * Final;
    Run;

    /*Rand function*/
    Data Template;
    do i = 1 to 10;
    output;
    end;
    Run;
    /*create random number*/
    /*uniform distribution – every data point along the line has equal probability of being selected*/
    Data Rand1;
    Set Template;
    Randno = Rand(‘uniform’);
    Run;
    /*normal distribution – second and third parameters are mean and st*/
    Data Rand2;
    Set Template;
    Randno = Rand(‘normal’, 5, 1.2);
    Run;
    /*Bernoulli Distribution – The second parameter is the probability*/
    Data Rand3;
    Set Template;
    Randno = RAND(‘bernoulli’, 0.7);
    Run;
    /*Poisson distribution – second parameter is mean*/
    Data Rand6;
    Set Template;
    Randno = RAND(‘poisson’, 2);
    Run;
    Data Stat;
    Input Player $ Stat $ Current;
    Datalines;
    L.James Points 28.6
    L.James Rebounds 10.3
    L.James Assists 6.4
    L.James Blocks 0.8
    L.James FG 47.7
    ;
    Run;
    /*generate random number within the range of 0.9-1.1 with uniform distribution*/
    Data Template2;
    do i = 1 to 5;
    output;
    end;
    Run;
    /*set the boundries of the output values in 2nd and 3rd parameters*/
    /*apply the radom number on the current stats to get predicted value*/
    Data PredStat;
    Set Stat;
    PredSat = Current * RAND(‘uniform’, 0.9, 1.1);
    Run;

    /*round function*/
    Data Numbers;
    input randno;
    datalines;
    7.820995511
    5.6069808411
    2.6857884419
    8.2832695455
    4.8186799493
    3.1513387109
    11.895538688
    6.9183423654
    5.9589683432
    7.7071493822
    ;
    Run;
    /*round numbers the second parameter is the multiple of number of decimal places*/
    Data Numbers2;
    Set Numbers;
    Randno2 = round(Randno, 0.2);
    Run;
    /*e.g. if just one decimal place then 0.1*/
    Data Numbers2;
    Set Numbers;
    Randno2 = round(Randno, 0.1);
    Run;
    /*e.g. if just round in 100s then then 1000*/
    Data Numbers1;
    Input Revenue;
    Datalines;
    1000320
    2030002300
    3253232
    532150212
    234123342
    ;
    Run;
    Data Numbers2;
    Set Numbers1;
    Randno2 = round(Revenue, 1000);
    Run;

    Data Grocery;
    Input Item Price Discount;
    Datalines;
    1 60.49 10
    2 79.99 13
    3 47.99 7
    4 50.09 0
    5 43.99 5
    6 81.79 7
    7 68.17 3
    8 103.49 10
    9 121.78 8
    10 78.69 8
    ;
    Run;
    Data Grocery2;
    Set Grocery;
    DisPrice = round(Price * (1-Discount/100), 0.01);
    Run;
    /*Floor function – round down numeric values*/
    Data Numbers;
    input randno;
    datalines;
    7.820995511
    5.6069808411
    2.6857884419
    8.2832695455
    4.8186799493
    3.1513387109
    11.895538688
    6.9183423654
    5.9589683432
    7.7071493822
    ;
    Run;
    /*round down the values to whole numbers*/
    Data Numbers2;
    Set Numbers;
    Randon2 = Floor(Randno);
    Run;
    /*round down the values to 1 decimal place – cannot do directly with floor. needs some math tweak*/
    Data Numbers2;
    Set Numbers;
    Randon2 = Floor(Randno*10)/10;
    Run;

    /*Substr function – get substrings from strings or characters*/
    Data PhoneList;
    Input Tel $20.;
    datalines;
    416-222-2222
    905-222-333365
    289-222-4444
    604-333-2121
    780-444-8989
    514-555-59592
    902-333-1414
    204-999-4646
    877-111-1212
    800-222-8686
    ;
    Run;
    /*The second parameter tells SAS where substring starts. If end is not specified, it means get all the way to the end*/
    Data PhoneList2;
    Set PhoneList;
    Ext1 = substr(Tel, 9);
    Run;
    /*If there is a third parameter, it specifies the number of characters to extract*/
    Data PhoneLisr3;
    Set PhoneList;
    Ext2 = substr(Tel, 9, 2);
    Run;
    /*get the first 3 characters – starting position is 1 and no. is 3*/
    Data PhoneLisr4;
    Set PhoneList;
    Ext2 = substr(Tel, 1, 3);
    Run;
    /*Length function – Length() get the number of characters of the variables*/
    Data List;
    Input Name : $20.;
    datalines;
    Anna
    Christopher
    Mary
    Jonathan
    Tim
    Elizabeth
    Amy
    schwarzenegger
    Kevin
    Porter
    ;
    Run;
    Data List2;
    Set List;
    Len = Length(Name);
    Run;

    Data FirstName;
    Input Name $80.;
    datalines;
    Anna
    Christopher
    Mary
    Jonathan
    Tim
    Elizabeth
    Amy
    (*$##*($(#*$(#*$(*#($*#(*$#*$#*$(*#*$#*$#*$*#(*$(#
    schwarzenegger
    Kevin
    Porter
    ;
    Run;
    Data FirstName2;
    Set FirstName;
    Len = Length(Name);
    Run;
    /*in sas, operations are within data tables, you need to assign a name tothe function output*/

    /*Double stroke – combine character variabls Var3 = Var1 || Var2;*/
    Data Text;
    Infile datalines dsd dlm=’,’;
    Length Chr1 $10 Chr2 $50;
    Input Chr1 $ Chr2 $;
    Datalines;
    This is a ,SAS tutorial.
    My name is, Jonathan.
    Today’s to,pic is Combing Character Variables.
    ;
    Run;
    /*use || to combine the texts*/
    /*strings are separated by space*/
    Data Text2;
    Set Text;
    Chr3 = Chr1 || Chr2;
    Run;

    Data Profile;
    input SiteID $ PatientID $;
    datalines;
    S01 991
    S01 992
    S01 993
    S02 991
    S02 992
    S02 993
    S03 991
    S03 992
    S03 993
    S04 991
    ;
    Run;

    Data Profile2;
    Set Profile;
    ID = SiteID || PatientID;
    Run;
    /*Trim function – to remove the trailing space from SiteID*/
    Data Profile3;
    Set Profile;
    ID = Trim(SiteID) || PatientID;
    Run;

    Data FirstLast;
    Input First $ Last $;
    Datalines;
    Justin Jeever
    Amy Chang
    Alex Toby
    Andrew Khan
    MaryAnn Garabed
    Leslie Song
    Jeff Barrett
    Jeev Daha
    Casey Carson
    Ann Chen
    ;
    Run;
    /*apart from characters, symbols and spaces can also be combined using ||. It’s like & in excel*/
    Data FirstLast2;
    Set FirstLast;
    Name = trim(First) ||”, “|| Last;
    Run;

    /*Index function – Var2=Index(Var1,”SAS”);*/
    Data Example;
    Text = “This is a SAS tutorial”;
    Run;
    /*get the starting position of the string. spaces also count*/
    Data Example2;
    Set Example;
    Ind = Index(Text, “SAS”);
    Run;
    /*If the word is not found, the index function will return to the value of 0*/
    Data Example3;
    Set Example;
    Ind = Index(Text, “training”);
    Run;
    /*the mathing is case sensitive. If not strictly matches, it will still return to 0*/
    Data Example4;
    Set Example;
    Ind = Index(Text, “sas”);
    Run;

    Data MathExam;
    infile datalines dsd;
    input Comment $40.;
    datalines;
    Mary’s exam result is 89
    Jonathan’s exam result is 77
    Ann’s exam result is 97
    Christopher’s exam result is 68
    Peter’s exam result is 90
    Jeev’s exam result is 55
    Tina’s exam result is 86
    Porter’s exam result is 91
    MaryAnn’s exam result is 69
    Tim’s exam result is 72
    ;
    Run;
    /*use index + substr to extract substrings*/
    /*extract the results from string using index and substr*/
    Data MathExam2;
    Set MathExam;
    Ind = Index(Comment, “t is”)+5;
    Result = substr(Comment, Ind, 2);
    Drop Ind;
    Run;
    /*I did not use “is” as index as the name Christeper also contains “is”*/

    /*indexc function – similar to index function. searches for individual characters instead of the whole text*/
    /*i.e. does the text contain any character of the string*/
    /*Var2=Indexc(Var1,”!@#”);*/
    /*it’s ususally used to clean the data*/
    Data Name;
    Input FirstName $20.;
    Datalines;
    Abraham
    Jonathan
    MaryAnn
    Christopher
    Benjamin
    Michael
    Terry###
    Anderson
    Paul
    Porter
    ;
    Run;
    Data Name2;
    Set Name;
    Ind = Indexc(FirstName, “#@$%”);
    Run;
    /*The outcome returns to 0 or the starting position of the matching character*/

    /*Find the special character as the key to position the result number*/
    Data Mathexam3;
    Set Mathexam;
    Ind = Indexc(Comment, “‘”)+ 18;
    Result = substr(Comment, Ind, 2);
    Drop Ind;
    Run;

    /*upcase, lowcase and propcase functions*/
    /*Var2=UPCASE (Var1); Var3=LOWCASE (Var1); Var4=PROPCASE(Var1);*/
    /*Propcase mean cap in every first character*/
    Data Example;
    Comment = “This is an apple.”;
    Comment2 = Upcase(Comment);
    Comment3 = Lowcase(Comment);
    Comment4 = Propcase(Comment);
    Run;
    /*this is useful when we need to index something. As index function is case sensitive*/

    Data Symptoms;
    Infile datalines dsd dlm=”~”;
    Length Sickness $15 Symptoms $100;
    Input Sickness $ Symptoms $;
    Datalines;
    Nausea~headache,vomiting,Dizziness,lightheadedness,Diarrhea,abdominal pain
    Asthma~Coughing,shortness of breath,Chest Tightness,Wheezing
    Diabetes~Weight change,Extreme FATIGUE,Unusual thirst,frequent urination
    Hepatitis~fatigue,poor appetite,belly Pain,Mild fever
    ;
    Run;
    /*to find the keywords in the text chunks that have both uppper cases and lower cases, it’s better standadise the texts first*/
    Data Symptoms2;
    Set Symptoms;
    Symptoms = Lowcase(Symptoms);
    Ind = Index(Symptoms, “fatigue”);
    Run;

    /*rename, keep and drop statements*/
    /*RENAME NAM1 = NAM2; KEEP Var1 Var2 Var3; DROP Var4 Var5 Var6;*/
    /*Rename refers to rename variables*/
    Data Profile;
    Input First_Name $ Last_Name $ Participants_age;
    Datalines;
    Porter Williams 28
    Amy Battison 43
    John Lim 22
    Casey Walson 27
    Black Chen 42
    Dominic Abraham 24
    Mary Katy 18
    Tina Kuo 59
    Timothy Latino 45
    Nigel Benjamin 34
    ;
    Run;

    Data Profile2;
    Set Profile;
    Rename First_Name = First
    Last_Name = Last
    Participants_age = Age;
    Run;
    /*rename is one statement no matter how many vars are renamed – don’t add separate ; */
    Data Profile3;
    Set Profile2;
    Keep First Age;
    Run;
    Data Profile4;
    Set Profile2;
    Drop Last;
    Run;
    /*Keep and drop can also be added at set stage which can imcrease the efficiency and save space when operating on big data*/
    Data Profile3;
    Set Profile2 (Keep = First Age);
    Run;
    Data Profile4;
    Set Profile2 (Drop = First Age);
    Run;
    Data Margarin;
    Set SASHELP.Margarin;
    Keep HouseID Choice Brand;
    Rename HouseID = ID;
    Run;
    /*If else statements*/
    /*IF Var1=1 Then Var2=2; ELSE Var2=4;*/
    Data Exam;
    Input StudentNo $ MidTerm Final;
    Datalines;
    STU001 56 98
    STU002 34 76
    STU003 89 93
    STU004 66 77
    STU005 74 47
    STU006 63 79
    STU007 80 82
    STU008 29 74
    STU009 69 70
    STU010 92 99
    ;
    Run;
    /*If and Else are separate statements*/
    Data Exam2;
    Set Exam;
    If MidTerm > 50 and Final > 50 Then Result = “Pass”;
    Else Result = “Fail”;
    Run;

    /*Do statement – it allows you to perform multiple actions within a if-then statement*/
    /*put the actions between do and end*/
    Data Exam3;
    Set Exam;
    If Final > 50 then do;
    Result = “Pass”;
    Mark = 0.5*MidTerm + 0.5*Final;
    End;
    Run;
    /*Do statement starts with a do and you can then execute as many as actions as you want within the do statement. */
    /*Finally you close the statement with an end statement*/
    Data Cars;
    Set SASHELP.Cars;
    If MSRP <40000 Then Class = “Economy”;
    Else Class = “Luxury”;
    Run;

    /*Coding exersice*/
    /*Exersice 1 – Calculate Phone Bills with consideration of extra costs*/
    Data Phonebill;
    Input BillNo $ Minutes Messages Data;
    Datalines;
    B001 238 15 670
    B002 323 39 1300
    B003 89 2 230
    B004 124 22 890
    B005 156 34 1290
    B006 122 22 1180
    B007 211 26 450
    B008 189 28 990
    B009 123 23 220
    B010 101 7 330
    ;
    Run;
    /*Phonebill exercise – you need to think through the problem before writing the code.*/
    /*e.g. how to caputure the extra minutes/message */
    /*My answer*/
    Data Phonebill2;
    Set Phonebill;
    If Minutes > 200 Then ExtraMin = Minutes – 200; Else ExtraMin = 0;
    If Messages > 25 Then ExtraMsg = Messages – 25; Else ExtraMsg = 0;
    If Data > 1000 Then ExtraData = Data – 1000; Else ExtraData = 0;
    If Minutes <= 200 and Messages <= 25 and Data <= 1000 Then Bill = 30;
    Else Bill = 30 + 0.1*ExtraMin+ 0.5*ExtraMsg+ 0.05*ExtraData;
    Run;
    /*Model answer*/
    /*It uses a Max function to avoid using the if else statement. Bascially similiar approach*/
    /*A good way of using Max is to use a variable against a benchmark / baseline / 0*/
    Data PhoneBill3;
    Set PhoneBill;
    Base = 30;
    Add_Min = Max(Minutes-200, 0);
    Add_Meg = Max(Messages-25, 0);
    Add_Dat = Max(Data-1000, 0);
    Amount = Base + Add_Min*0.1 + Add_Meg*0.5 + Add_Dat*0.05;
    Run;
    /*Exersice 2 – Get average*/
    Data Race;
    Input Car $ Lap1 Lap2 Lap3 Lap4 Lap5;
    Datalines;
    CAR001 390.31 410.9 460.8 380.56 470.27
    CAR002 410.19 440.99 390.61 400.14 430.62
    CAR003 420.86 410.68 400.61 390.29 460.16
    CAR004 390.65 380.87 430.19 380.82 470.54
    CAR005 420.62 430.82 430.15 400.95 430.33
    ;
    Run;
    Data Race2;
    Set Race;
    Avg = mean(Lap1, Lap2, Lap3, Lap4, Lap5);
    Run;
    /*Exersice 3 – Generate random index with uniform distribution and get only index 4 claims*/
    Data Expense;
    do i = 101 to 200;
    Claim = “CLAIM” || put(i, 3.0);
    output;
    end;
    drop i;
    Run;
    /*My answer uses round(), rand() and if*/
    Data Expense2;
    Set Expense;
    Index = Round(Rand(‘uniform’, 1, 10));
    If Index = 5;
    Run;
    /*Model answer uses floor(), rand() and if*/
    Data Expense3;
    Set Expense;
    RandInt = Floor(Rand(‘uniform’)*10) + 1;
    If RandInt = 5;
    Run;
    /*Rand(‘uniform’) itself is 0-1 when the boundry is not set*/

    /*Exersice 4 – Get the age at marridge*/
    Data Date;
    Input ID $ BirthDate MarriageDate;
    Format BirthDate MarriageDate yymmdd10.;
    Datalines;
    ID001 4423 19215
    ID002 3234 19216
    ID003 3242 19215
    ID004 5124 19217
    ID005 4921 19216
    ID006 2772 19215
    ID007 4212 19216
    ID008 3821 19215
    ID009 2412 19216
    ID010 4124 19217
    ;
    Run;
    Data Date2;
    Set Date;
    Age = Floor((MarriageDate – BirthDate)/365.25);
    Run;
    /*We need to use Floor to round down in this case, as unless the birth day is past, we consider the person still at a smaller age.*/

    /*Exersice 5 – Separate First and Last names*/
    Data Name;
    Infile Datalines dlm=”,”;
    Length Name $30;
    Input Name $;
    Datalines;
    Nancie Strohl
    Janee Salvaggio
    Annabelle Schwanke
    Marybelle Holtsclaw
    Art Dahlen
    Barrie Bianco
    Barbera Deltoro
    Launa Gaudreau
    Isabelle Tiernan
    Annamae Bellin
    ;
    Run;

    /*My answer: Index the space and substr names*/
    Data Name2;
    Set Name;
    Ind = Indexc(Name, ” “);
    FirstName = substr(Name, 1, Ind-1);
    LastName = substr(Name, Ind+1);
    Run;

    /*Exersice 6 – Classify the products based on the first digit*/
    Data Product;
    Input Barcode $16.;
    Datalines;
    584563449723
    584563358735
    384563675036
    084563154684
    084563447166
    984563804834
    284563674476
    884563657341
    784563480621
    384563604852
    ;
    run;
    /*Need to be careful about indent for datalines, the spaces might be counted as length*/
    /*My answer – If-then function, substr function*/
    Data Product2;
    Set Product;
    CatInd = substr(Barcode, 1, 1);
    If CatInd = 2 Then Cat = “Meats, Fruits, Vegetables”;
    If CatInd = 3 Then Cat = “Drug”;
    If CatInd = 4 Then Cat = “Loyalty Card”;
    If CatInd in (5, 9) Then Cat = “Coupon”;
    If CatInd in (0, 1, 6, 7, 8) Then Cat = “General”;
    Run;
    /*CatInd in (1, 2, 4, 5, 6) can be used, same as in SQL*/

    /*Model answer – if d1 = 5 or d1 = 9 is the same as if d1 in (5, 9)*/
    Data Product3;
    Set Product;
    Length class $30;
    d1 = substr(barcode, 1, 1);
    if d1 = 2 then class = “Meat, Fruits, Vegetables”;
    else if d1 = 3 then class = “Drug”;
    else if d1 = 4 then class = “Loyalty Card”;
    else if d1 = 5 or d1 = 9 then class = “Coupon”;
    else if d1 in (0, 1, 6, 7, 8) then class = “General”;
    Run;

    /*Exersice 7-8 extract other info from barcode*/
    Data Product4;
    Set Product;
    Manufacture = substr(barcode, 2, 5);
    Run;

    Data Product5;
    Set Product;
    Product = substr(barcode, 7, 5);
    If Product = 67503;
    Run;
    /*Exersice 9 check whether the last digit (check digit) is correct*/
    Data Product6;
    Set Product;
    /*Last Digit*/
    d12 = substr(barcode, 12, 1);
    /*Check Digit Calculations*/
    d1=substr(barcode, 1, 1);
    d2=substr(barcode, 2, 1);
    d3=substr(barcode, 3, 1);
    d4=substr(barcode, 4, 1);
    d5=substr(barcode, 5, 1);
    d6=substr(barcode, 6, 1);
    d7=substr(barcode, 7, 1);
    d8=substr(barcode, 8, 1);
    d9=substr(barcode, 9, 1);
    d10=substr(barcode, 10, 1);
    d11=substr(barcode, 11, 1);
    /*Calculation*/
    odd = sum(of d1, d3, d5, d7, d9, d11)*3;
    even = sum(of d2, d4, d6, d8, d10);
    check = 10 – mod(odd+even, 10);
    /*Check if Last Digit is the same as Check Digit*/
    if d12 ^= check then flag = 1;
    Run;
    /*can I use the do / loop logic for it? i*2-1 for odd and i*2 for even*/
    /*Yes, you can use do statement or array*/

    /*Code Project for Module 3*/
    /*use %include to run a file of long codes – this can be used to run stored procedures*/
    %include ‘C:\Users\xxxLearning\SAS-test\SASUniversityEdition\myfolders\coding_exercise_(code).txt’;
    /*Perform data checks on the dataset*/
    /*Instead of print distinct values to check, use some calculations and assumptions to validate*/
    /*Numeric errors: Abnormal age, income and spend*/
    Data CkAge;
    Set Customer;
    Age = (DOS – DOB) / 365.25;
    If Age > 140 Or Age < 0;
    Comment = “Abnormal Age Value ” || Compress(Age) ||””;
    Variable = ‘Age’;
    Value = put(Age, best.);
    Run;
    /*make reasonable assumptions based*/
    Data CkIncome;
    Set Customer;
    If Income > 500000 Or Income < 0;
    Comment = “Negative Income Value ” || Compress(Income) ||””;
    Variable = ‘Income’;
    Value = put(Income, best.);
    Run;
    /*Wrong spend is based on the assumption of “it should not be more than 3 times of income”*/
    Data CkSpend;
    Set Customer;
    If Spend > 3*Income;
    Comment = “Abnormal Income Value (” || Compress(Spend) ||”)”;
    Variable = ‘Spend’;
    Value = put(Spend, best.);
    Run;
    /*CHAR errors: special characters in name, wrong custid formats, wrong occup input*/
    /*Use indexc to inspect special characters or even characters*/
    Data CkCustID;
    Set Customer;
    Lg = Length(CustID);
    Ind = Indexc(CustID, “abcdefghijklmnopqrstuvwxyz,.!/@#$%^&*()_+?}-{<“);
    If Lg ^= 8 Or Ind ^=0;
    Comment = “Special Char in CustID (” || Compress(CustID) ||”)”;
    Variable = ‘CustID’;
    Value = CustID;
    Run;

    /*check whether the name strings contain special characters – input errors*/
    Data CkNames;
    Set Customer;
    Ind1 = Indexc(First, “!@#$%^&*()<>?{}:1234567890”);
    Ind2 = Indexc(Last, “!@#$%^&*()<>?{}:1234567890”);
    If Ind1 ^= 0 Or Ind2 ^=0;
    Comment = “Special Char in Last Name (” || Compress(Last) ||”)”;
    Variable = ‘Last’;
    Value = Last;
    Run;
    /*- or / is allowed in occup*/
    Data CkOccup;
    Set Customer;
    Ind = Indexc(Occup, “!@#$%^&*()<>?{}:1234567890”);
    If Ind ^= 0;
    Comment = “The Occup (” || compress(Occup) || “) contains invalid character(s).”;
    Variable = ‘Occup’;
    Value = Occup;
    Run;
    /*check categorical variables*/
    Data CkGender;
    Set Customer;
    If Gender ^= ‘Female’ and Gender ^= ‘Male’;
    Comment = “Unknown Gender (” || compress(Gender) || “)”;
    Variable = ‘Gender’;
    Value = Gender;
    Run;
    Data CkEdu;
    Set Customer;
    If Edu not in (1, 2, 3, 4);
    Comment = “Undefined Edu Level (” || compress(Edu) || “)”;
    Variable = ‘Edu’;
    Value = put(Edu, best.);
    Run;
    /*although Edu has been recoded in Texts in format statement, it still should be indexed with raw values*/
    Data CkStatus;
    Set Customer;
    If Status not in (‘Married’, ‘Single’, ‘Divorced’);
    Comment = “Unkown Marridge Status (” || compress(Status) || “)”;
    Variable = ‘Status’;
    Value = Status;
    Run;
    /*Need to check the trauncating issue – use inform*/
    Data ErrorReport (Keep = CustID Variable Value Comment);
    Retain CustID Variable Value Comment;
    Length Variable $30 Value $30 Comment $500;
    Set CkAge CkIncome CkSpend CkCustID CkNames CkOccup CkGender CkEdu CkStatus;
    Run;
    /*Use retain statement to re-order the columns. Retain needs to be before Set*/
    Proc Export Data = ErrorReport
    OutFile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Error Report.csv’
    Replace
    Dbms = csv;
    Run;

    **Somethings to take note of when appending:*;
    /*The values are in different types by default as some are numeric and some are characters*/
    /*Thus we need to transfrom numebers in numeric forms into char by using put, or else they can’t be appended*/
    /*Another thing is that if length is not pre-defined, the char length of the variables will take the length of the first one*/
    /*Thus the latter ones could be truncated. We need to define the length before Set Statement*/
    /*Although Edu have been recoded in texts, the error ones are not coded, so they are still in numeric forms. Their data types need to be changed*/
    /*Own test – find a word that starts with t, ends with t and has a t in the middle*/

    /*Model Answer*/
    **The model answer does everything in one shot – it uses if then do – end logic to select and combine all the errors within */
    one data step**;

    Proc Import datafile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\word.txt’
    out = word
    dbms = tab
    replace;
    delimiter = ” “;
    Run;
    Data Word;
    Set word;
    Rename _2 = Words;
    Run;
    /*create vars and check*/
    Data Word2;
    Set Word;
    Lg = Length(Words);
    FL = substr(Words, 1, 1);
    LL = substr(Words, Lg, 1);
    Run;
    /*Filter and select*/
    Data Word3;
    Set Word2;
    Mid = substr(Words, 2, Lg-2);
    If Lowcase(FL) = ‘t’ and Lowcase(LL) = ‘t’ and Index(Lowcase(Mid), ‘t’)^=0;
    Run;
    /*In the end, the answer is Teapot, lol*/

    **6 attributes of SAS variable*;
    **-name -lable -length -type -format -informat*;
    *It is estimated that 30% of the programming error are related to incorrect variable attributes.;
    Data Fish;
    Set SASHelp.Fish;
    Run;
    **Specis has name: Species, label: n.a. type: text length: 9 format and informat: n.a.;
    /*We can go properties of the dataset, go to columns and click in the variable to check its attributes*/

    /*Another way to review is via proc contents*/
    /*Proc Contents to view variable attributes*/
    Proc Contents Data = SASHelp.Fish;
    Run;
    /*In Alphabetic List of Variables and Attributes table, you will be able to find the variable attributes info*/
    Data Cars;
    Set SASHelp.cars;
    If Enginesize > 4;
    Run;
    **Variable naming rule*;
    /*no spaces, no special symbols except for “_”, no longer than 32 chars, no starting with numbers */

    **Variable labels are optional and it describes the variable*;
    Data Cars;
    Set SASHelp.cars;
    Label EngineSize = “The size of the engine”;
    Run;
    /*by using this label statement, the label of the variable will be added/updated*/
    /*it’s usually used as addtional information for third party – can use up to 256 chars*/

    Data Admission;
    Input Applicant $ Crit1 $ Crit2 $ Crit3 $;
    Datalines;
    P00851 Met Met Met
    P00852 Met NotMet Met
    P00853 Met Met Met
    P00854 NotMet NotMet NotMet
    P00855 Met Met NotMet
    P00856 Met Met Met
    P00857 NotMet Met Met
    P00858 Met NotMet Met
    P00859 Met Met Met
    P00860 Met Met Met
    ;
    Run;

    Data Admission;
    Set Admission;
    Label Crit1 = “Applicant’s GPA above 3.5”
    Crit2 = “Applicant’s did not fail more than 4 courses”
    Crit3 = “Applicant’s has a Bachelor of Science degree”
    ;
    Run;
    /*Only one label statement is needed for all the labels added*/
    Data Number;
    Input Age Amount Date;
    Format Date yymmdd10.;
    Datalines;
    41 3409 18478
    37 2036 18946
    48 3763 20288
    58 2815 19732
    59 4891 20616
    26 2185 20687
    46 3362 20514
    64 4653 19907
    60 3878 20384
    61 3307 20386
    ;
    Run;

    Proc Contents Data = Number;
    Run;

    /*Missing value*/
    /*Missing value can be put as a dot in datalines*/
    Data Number2;
    Input Age Amount Date;
    Format Date yymmdd10.;
    Datalines;
    41 3409 18478
    37 2036 18946
    . 3763 20288
    58 2815 19732
    59 4891 20616
    26 2185 20687
    46 . 20514
    64 4653 19907
    60 3878 20384
    61 3307 20386
    ;
    Run;
    /*When we check the data types, we can see that money amount with dollar sign, date, missing value can all be kept under numeric variable*/

    Data Char;
    Infile Datalines DLM=”,”;
    Length Name $8 ID $8 Comment $13 Descrip $10 Age $5;
    Input Name $ ID $ Comment $ Descrip $ Age $;
    Datalines;
    John,ID001,This is John,Male,36
    Mary,ID002,This is Mary,Female,19
    Anton,ID003,This is Anton,Male,28
    Chris,ID004,This is Chris,Female,48
    ;
    Run;
    /*In this example, the age is formatted in char. Char data type can capture both numbers and texts*/
    /*However, it’s a good programming practice to keep the numeric values under numeric variabl.*/
    **Data truncation is a very common problem with character variable. Thus we need to constantly check the df created in case of errors;

    **The length of a variable is the number of bytes assigned to the variable regardless numeric or char vars;
    **3 bytes of numbers is not 3 diguts, the largest integer 3 bytes can caputure is 8,192, largest number 8 bytes can capture is 9,007,199,254,740,990 which is big enought;
    **A numeric variable cannot have a length greater than 8;
    **while it’s a different thing for char vars, length 8 can only capture 8 characters;
    Data Char;
    Input Name $ Gender $;
    Datalines;
    Christopher M
    Elizabeth F
    Kanjirathinkal M
    ;
    Run;
    /*As shown in the example, data truncation takes place a lot for char vars. Thus we need to use length statement to assign a new length for them*/
    Data Char2;
    Length Name $15;
    Input Name $ Gender $;
    Datalines;
    Christopher M
    Elizabeth F
    Kanjirathinkal M
    ;
    Run;
    /*Length should be before the Input statement*/
    Data PhoneBook;
    Length PhoneNo $15;
    Input Name $ PhoneNo $ Gender $;
    Datalines;
    Jason 123-456-7890 Male
    Benjamin 987-654-3210 Female
    ;
    Run;

    /*Data Format*/
    **Variable formats change the display of the data. It masks the data so that it’s different than the value stored in the system;
    Data Display;
    Input Num1 Num2;
    Datalines;
    10001 10001
    10002 10002
    10003 10003
    10004 10004
    10005 10005
    10006 10006
    10007 10007
    10008 10008
    10009 10009
    10010 10010
    ;
    Run;
    /*In this program, we set the format of Num2 variable to dollar 10. This format means that the number now includes a dolalr sign and the comma separator*/
    Data Display2;
    Set Display;
    Format Num2 Dollar10.;
    Run;
    /*In it’s dataset properties, the format will also be displayed as dollar10.*/
    **Variable format only changes the display of the data, not the internal data stored in the system;
    /*i.e. in the example above, Num1 and Num2 are the extact same data*/
    /*Numeric data can also be displayed as date/time format*/
    Data Display3;
    Set Display;
    Format Num2 yymmdd10.;
    Run;
    Data Display4;
    Set Display;
    Format Num2 time8.;
    Run;

    Data Transac;
    Input Order $ Date Time Total;
    Datalines;
    BA00001 19000 40143 1432
    BA00002 19000 51865 1455
    BA00003 19000 68954 2435
    BA00004 19000 49865 894
    BA00005 19000 53214 1745
    BA00006 19000 64521 997
    BA00007 19000 74521 562
    BA00008 19000 56321 132
    BA00009 19000 51236 987
    BA00010 19000 45698 562
    ;
    Run;
    /*In this example, format is used to get the proper display of data*/
    Data Transac2;
    Set Transac;
    Format Date yymmdd10. Time time8. Total dollar10.;
    Run;
    /*The differences between figures stored in numeric vs. char format*/
    Data DollarAmt;
    Input D1 D2 $;
    Format D1 dollar9.0;
    Datalines;
    10000 $10,000
    20000 $20,000
    30000 $30,000
    40000 $40,000
    50000 $50,000
    60000 $60,000
    70000 $70,000
    80000 $80,000
    90000 $90,000
    100000 $100,000
    ;
    Run;
    /*In this example, D1 is in dolalr10. type in numeric format while D2 is in char format. They look the same but only D1 allows numeric data handling and summary.*/
    /*Thus when needed, you should always use numeric var with a relevant format as opposed to using a char value. You can always change the format of display for better viewing of info*/

    /*The format of W.D.*/
    /*W defines the digits for display the full number including the digits*/
    /*D defines the number of decimal places*/

    Data Num;
    Input Num1 Num2;
    Datalines;
    12345.67 12345.67
    34561.11 34561.11
    21165.35 21165.35
    13216.31 13216.31
    61343.11 61343.11
    15654.61 15654.61
    21313.24 21313.24
    44131.31 44131.31
    94543.13 94543.13
    48634.31 48634.31
    ;
    Run;
    Data Num2;
    Set Num;
    Format Num2 10.4;
    Run;
    /*10 is the number of digits of the data string (including the decimal dot itself*/
    /*One thing must be beard in mind is that no matter how the format changes, the data does not change*/
    Data RNumber;
    Input Num1;
    Datalines;
    23483.8492342
    432.2348934
    5489345.23423
    432.53242
    98349.2343
    45893495.2343
    45489.3454
    6845984.234
    58498349.543
    25484.23545
    ;
    Run;
    Data RNumber1;
    Set RNumber;
    Format Num1 13.4;
    Run;
    /*This is how you put display numbers with comma*/
    Data Num;
    Input Num1 Num2;
    Datalines;
    12345.67 12345.67
    34561.11 34561.11
    21165.35 21165.35
    13216.31 13216.31
    61343.11 61343.11
    15654.61 15654.61
    21313.24 21313.24
    44131.31 44131.31
    94543.13 94543.13
    48634.31 48634.31
    ;
    Run;
    Data Num2;
    Set Num;
    Format Num2 Comma.;
    Run;
    /*However, after quickly checking the output, we can see that the decimals are gone*/
    /*Here is a way to keep / re-define the decimals */
    Data WD2;
    Set Num;
    Format Num2 Comma9.2;
    Run;
    /*It’s the same usage as format w.d*/
    /*You can also use dollar w.d. to add dollar sign and decimal places*/
    Data Transac;
    Input Order $ Date Time Total;
    Datalines;
    BA00001 19000 40143 1432
    BA00002 19000 51865 1455
    BA00003 19000 68954 2435
    BA00004 19000 49865 894
    BA00005 19000 53214 1745
    BA00006 19000 64521 997
    BA00007 19000 74521 562
    BA00008 19000 56321 132
    BA00009 19000 51236 987
    BA00010 19000 45698 562
    ;
    Run;
    Data Transac2;
    Set Transac;
    Format Total Dollar5. Total_After_tax Dollar8.2;
    Total_After_tax = Total * (1-0.1475);
    Run;
    **When adding dollar sign and comman, you need to count in the digits for dollar sign ($) and comma (,). If the digits are not counted in
    dollar sign or last decinal numbers will be removed;

    /*SAS Dates*/
    **SAS date starts at January 1, 1960. It’s represented by integer 0.
    **Jan 2, 1960 is represented by 1
    -1000 = April 6, 1957;
    **SAS uses this system – number indexing to keep track of dates, instead of 01JAN2018 or 2018-01-01;

    /*SAS Time*/
    **Each integer represents 1 second
    0 = 0:00:00
    360 = 0:06:00
    3600 = 1:00:00;

    /*Here is an example of how to conver the dates in number forms to be displayed in date format*/
    Data DFormat;
    Input Date;
    Datalines;
    -130001
    10002
    10003
    21500
    40005
    50006
    10007
    10008
    10009
    90010
    ;
    Run;
    Data DFormat2;
    Set DFormat;
    Format Date YYMMDD10.;
    Run;
    **DDMMYY10. is a very common date display type;
    Data DenAppt;
    Input Date Time NumDays;
    Format Date yymmdd10. time time5.;
    Datalines;
    20100 30000 30
    20110 40000 60
    20115 46000 14
    20103 36000 90
    20120 42000 60
    20117 44000 30
    20111 38000 14
    20109 37000 21
    20102 33000 28
    ;
    Run;
    Data DenAppt2;
    Set DenAppt;
    Format NextAppt yymmdd10.;
    NextAppt = Date + NumDays;
    Run;
    /*The format of NextAppt needs to be defined to be displayed in the same way as Date*/
    **Some other date formates used in SAS: yymmdd10. Date9. Date7. mmddyy10.;
    /*Here is an example to view these date formats*/
    Data DFmt;
    Input Date1 Date2 Date3 Date4;
    Datalines;
    12001 12001 12001 12001
    12002 12002 12002 12002
    12003 12003 12003 12003
    12004 12004 12004 12004
    12005 12005 12005 12005
    12006 12006 12006 12006
    12007 12007 12007 12007
    12008 12008 12008 12008
    12009 12009 12009 12009
    12010 12010 12010 12010
    ;
    Run;
    Data DFmt2;
    Set DFmt;
    Format Date1 yymmdd10. Date2 Date9. Date3 Date7. Date4 mmddyy10.;
    Run;
    **yymmdd10. = 1992-11-09 mmddyy10. = 11/9/1992 Date9. = 09NOV1992 Date7. = 09NOV92;
    /*The yymmdd10. and mmddyy10. has – and / respectively while Date7. and Date9. have no separator but use three letters as months*/

    **The two common time formats are Time5. and Time8.;
    **Time5. = 12:56 Time8. = 12:56:28;
    Data TFmt;
    Input Time1 Time2;
    Datalines;
    36001 36051
    39001 39001
    42001 42001
    45001 45001
    48001 48001
    51001 51051
    54001 54001
    57001 57001
    60001 60001
    63001 63001
    ;
    Run;
    Data TFmt2;
    Set TFmt;
    Format Time1 Time5. Time2 Time8.;
    Run;
    /*The Time8. won’t be the rounded Time5., it’s simply Time8. without seconds i.e. Time8. 10:50:59 = Time5. 10:50 not 10:51*/
    Data Attendance;
    Input EmpID $ Date In Out;
    Datalines;
    ID501 20010 28800 61200
    ID502 20010 30400 63200
    ID503 20010 29800 61800
    ID504 20010 33600 61700
    ID505 20010 27800 60200
    ID506 20010 29400 61900
    ID507 20010 29300 61850
    ID508 20010 31600 62930
    ID509 20010 32400 62300
    ID510 20010 31200 61590
    ;
    Run;
    Data Attendance2;
    Set Attendance;
    Format Date yymmdd10. In Time8. Out Time8. Late $7.;
    If In > 32400 Then Do;
    Late = “Late”;
    LateBy = (In – 32400)/60;
    End;
    Else Do;
    Late = “On Time”;
    LateBy = 0;
    End;
    Run;
    /*To set a time condition for variable, the time/date should still be in system format i.e. 10000/32400 etc. The way to calculate it is
    to use time – midnight time (00:00:00) and convert to seconds*/
    /*If else condition have separate statemetns each ends with “;” thus it’s if xxx; else yyy;
    to have multiple conditions in one if else logic, use if xxx then do; yyy zzz; end; */
    /*For if then do end logic, there is “;” after then do and each assignment/condition is separated by “;”*/

    /*Create your own format – Proc Format*/
    /*Sometimes we want to change the way how data is displayed. E.g. in the example below, survey outcomes are shown but we don’t know what they mean*/
    /*Thus we update their format for a clearer idea*/
    Data Survey;
    Input Q1 Q2 Q3 Q4;
    Datalines;
    3 1 2 4 5
    1 1 1 1 1
    5 3 5 5 5
    4 5 3 2 2
    1 3 5 4 5
    2 3 5 4 5
    3 4 5 2 3
    2 4 1 3 5
    5 4 4 4 4
    1 1 2 5 4
    ;
    Run;
    /*We will need to use proc format to create a format to the dataset.*/
    Proc Format;
    Value Ques
    1 = “Strongly Disagree”
    2 = “Disagree”
    3 = “Neutral”
    4 = “Agree”
    5 = “Strongly Agree”;
    Run;
    /*With the proc format, format ques is created.*/
    /*When applying the format, we still use format statement in data step as per normal. Just add Ques after the variables that needs to be formatted*/
    /*We only need to add one Ques. after all the variables that need to be formatted*/
    /*. is needed after all the formats e.g. dollar10.2, Ques. comma.*/
    Data Survey2;
    Set Survey;
    Format Q1 Q2 Q3 Q4 Ques.;
    Run;
    /*Now the display shows “agree/strongly agree” instead of 4/5*/

    **for categorical variables, it’s a good practice to keep them in numeric variables with a custom format
    because it’s easier to do analysis and operations and it helps the variable to aviod the character trauncation issue;
    Data Profile;
    Length ID Gender Race $20.;
    Input ID Gender $ Race $;
    Datalines;
    ID001 Male Caucasian
    ID002 Female Asian
    ID003 Female Black
    ID004 Male Asian
    ID005 Male Caucasian
    ID006 Female Caucasian
    ID007 Male Black
    ID008 Female Black
    ID009 Male Caucasian
    ID010 Male Caucasian
    ;
    Run;
    /*You can add multiple formats in one proc format step*/
    Proc Format;
    Value Gender
    1 = “Male”
    2 = “Female”;
    Value Race
    1 = “Caucasian”
    2 = “Asian”
    3 = “Black”;
    Run;
    Data Profile;
    Input ID $ Gender Race;
    Format Gender Gender. Race Race.;
    Datalines;
    ID001 1 1
    ID002 2 2
    ID003 2 3
    ID004 1 2
    ID005 1 1
    ID006 2 1
    ID007 1 3
    ID008 2 3
    ID009 1 1
    ID010 1 1
    ;
    Run;
    /*Previously, the Gender and Race are text. Now they are numeric with text custom format.*/
    ** the data types are set upon dataset creation. It can’t be alter;

    /*(Best.) Format*/
    **The (Best.) format displays the data as the internal value stored in SAS;
    Data DateTime;
    Input Date Time;
    Format Date yymmdd10. time time5.;
    Datalines;
    18100 30540
    18210 40980
    18315 41230
    18403 34650
    18320 48970
    18217 45550
    18111 31100
    18009 37900
    18502 56400
    ;
    Run;
    /*In this example, the internal values of date and time are in numeric forms but have been formatted to be displayed as date and time formats.*/
    /*Best. can be used to remove special formats from a variable*/
    Data DateTime2;
    Set DateTime;
    Format Date Time Best.;
    Run;
    /*When a dataset has already been formatted, Best. can be used to convert it back to internally stored values – without any mask*/
    Proc Format;
    Value Ques
    1 = “Strongly Disagree”
    2 = “Disagree”
    3 = “Neutral”
    4 = “Agree”
    5 = “Strongly Agree”;
    Run;
    Data Survey;
    Input Q1 Q2 Q3 Q4;
    Format Q1 Q2 Q3 Q4 Ques.;
    Datalines;
    3 1 2 4 5
    1 1 1 1 1
    5 3 5 5 5
    4 5 3 2 2
    1 3 5 4 5
    2 3 5 4 5
    3 4 5 2 3
    2 4 1 3 5
    5 4 4 4 4
    1 1 2 5 4
    ;
    Run;
    /*Now I have a Survey dataset with clearly coded variables. However, if I want to do subset operations, it cannot be on formatted values.*/
    /*Thus I need to check back the original values*/
    Data Survey2;
    Set Survey;
    Format Q1 Q2 Q3 Q4 Best.;
    Run;
    /*From that I can see that if I want to choose all “Strongly Agree” on Q3, I should use 5 (by comparing Survey and Survey 2).*/
    Data Survey3;
    Set Survey;
    If Q3 = 5;
    Run;

    /*Summary of variable formats*/
    **It’s commonly used on numeric variables;
    **W.D format allows you to control how many decimal places to have;
    **Comma. Dollar. allow you to add “,” or “$”;
    **Create your own format use Proc Format;
    **Use Best. to get original formats;

    /*Put function*/
    **Converts a numeric variable to a character varaible;
    Data Conv;
    Input Num1 Num2;
    Datalines;
    1201.1234 20001
    1303.1234 20002
    1405.1234 20003
    1507.1234 20004
    1609.1234 20005
    1202.1234 20004
    1304.1234 20003
    1406.1234 20002
    1508.1234 20001
    1610.1234 20000
    ;
    Run;
    Data Conv2;
    Set Conv;
    Char1 = put(Num1, best.);
    Run:
    /*In this transofmation, Char1 HAS identical information as Num1, however, it’s a char var*/
    **The Put function has two parameters: the variable to be converted and the format to use for the conversion;
    /*In this example, best. is used as the format for the transformation and it tells SAS to convert the values based on the internal values*/
    /*You can also convert the number with another format*/
    Data Conv3;
    Set Conv;
    Char1 = put(Num1, dollar10.2);
    Run;
    /*This time the data is still in char form but with dollar sign and decimal places*/
    /*It can also be converted to a char var with date format*/
    Data Conv4;
    Set Conv;
    Char1 = put(Num1, yymmdd10.);
    Run;
    **It’s important to be aware of the differences of formatting a variable and converting a variable using the PUT function;
    **Formatting does not change the type of the varaible and put/input does;

    /*Use the conversion to put numeric into char date*/
    Data Trtment;
    Input PATID $ Infdt TrtDt;
    Format Infdt Trtdt Date9.;
    Datalines;
    PAT1001 20100 20102
    PAT1002 20101 20102
    PAT1003 20103 20102
    PAT1004 20102 20102
    PAT1005 20100 20102
    PAT1006 20101 20102
    PAT1007 20103 20102
    PAT1008 20100 20102
    PAT1009 20102 20102
    PAT1010 20100 20102
    ;
    Run;
    Data Trtment2;
    Set Trtment;
    Infdt1 = put(Infdt, mmddyy10.);
    Trtdt1 = put(Trtdt, mmddyy10.);
    If Infdt > Trtdt
    Then Do;
    Comment = “Patient ” || Compress(PATID) || ” received his/her first treatment on ” || Compress(Infdt1) || “However, the treatment is received prior to the date of informed consent ” || Compress(Trtdt1) ||””;
    End;
    Run;
    /*Compress statement is used to trim any spaces to ensure that there is no extra space. You can directly put “Patient ” PATID ” received his/her first treatment”*/

    /*Informat*/
    **Infomrt is mostly used when reading in an external file or converting a char var to a numeric var.;
    Data Char;
    Length Date $12 Time $8 Amount $15;
    Input Date $ Time $ Amount $;
    Datalines;
    2015-01-01 8:00:00 $10,100
    2015-01-02 8:05:00 $9,500
    2015-01-03 8:05:00 $12,600
    2015-01-04 8:10:00 $12,800
    2015-01-05 8:15:00 $13,100
    2015-01-06 8:05:00 $11,700
    2015-01-07 8:03:00 $15,900
    2015-01-08 8:02:00 $15,400
    2015-01-09 8:04:00 $14,200
    2015-01-10 8:05:00 $13,400
    ;
    Run;
    /*Input function*/
    /*In this dataset, all variables are in char forms but they can be converted to numeric forms*/
    Data Char2;
    Set Char;
    Date2 = Input(Date, yymmdd10.);
    Run;
    /*Date is Char var but it converted to numeric in Date2 variable*/
    **Same as put function, the input parameters are also 1.variable 2.informat;
    Data Char3;
    Set Char;
    Time2 = Input(Time, Time8.);
    Run;
    Data Char4;
    Set Char;
    Amount2 = Input(Amount, dollar8.);
    Run;
    /*Should use dollar8. to convert dollar amount*/
    **Here is a list of appropriate informat;
    *Pure numbers – Best.
    Dollar Aount – DollarW.D
    Date – yymmdd10. / mmddyy10. / date9. / date7.
    Time – time8. / time5.;
    *Informat used should correspond to the current char format;

    /*Invalid data error*/
    **When using the input funtion, you might encounter a note from the log saying “invalid argument to function input at
    Date = 2015- Date2=. _ERROR_ = 1 _N_ = 8;
    *This is usually due to an incomplete data point from the variable e.g. 2015- as comapred to 2015-10-11;
    *It’s not an error technically, but it’s a good practice to check back in the data to see what’s the issue as it happens often;

    **Format is to only change the display of the data but informat is to indeed change the data type;

    /*Exersice 1*/
    /*Create labels and custom formats for variables*/
    Data Survey;
    Input Q1 Q2 Q3;
    Datalines;
    1 3 5
    2 4 3
    1 2 1
    5 5 5
    4 5 3
    4 3 3
    2 3 1
    1 5 5
    2 4 5
    4 3 5
    ;
    Run;
    Data Survey2;
    Set Survey;
    Label Q1 = “Customer service representatives (CSRs) are well supervised.”
    Q2 = “Customer service representatives (CSRs) adhere to professional standards of conduct.:”
    Q3 = “Customer service representatives (CSRs) act in my best interest.”
    ;
    Run;
    /*Labels can be checked in properties – Columns – double click*/
    Proc format;
    Value Scale
    1 = “Strongly Disagree”
    2 = “Somewhat Disagree”
    3 = “Neither Agree nor Disagree”
    4 = “Somewhat Agree”
    5 = “Strongly Agree”
    ;
    Run;
    Data Survey3;
    Set Survey;
    Format Q1 Q2 Q3 Scale.;
    Run;
    **There is always a “.” after format! for custom format, we also need a “.” to tell SAS that’s it’s a format or SAS will see it
    as an uninitialised variable;
    /*The way to create formats and apply on dataset is to firstly create the format using proc format and then apply the format in data step*/;
    /*Proc format itself is independent from the dataset*/
    /*Model answer combined the label and format in the second data step after the custom formats are created*/

    /*Exersice 2*/
    /*Convert char vars to numeric vars and assign proper formats*/
    Data Listing;
    Length Adv_ID DateList NumMth Price $15;
    Input Adv_ID $ DateList $ NumMth $ Price $;
    Datalines;
    K9933-00019 01MAR2015 3 $650,000
    K9933-00020 15FEB2015 2 $380,000
    K9933-00021 12APR2015 4 $890,000
    K9933-00022 09MAY2015 3 $890,000
    K9933-00023 22FEB2015 2 $450,000
    K9933-00024 08APR2015 2 $360,000
    K9933-00025 03MAR2015 5 $440,000
    K9933-00026 28FEB2015 3 $860,000
    K9933-00027 19JAN2015 4 $740,000
    K9933-00028 01JAN2015 1 $980,000
    ;
    Run;
    /*Informat to convert*/
    Data Listing2;
    Set Listing;
    DateList2 = input(DateList, Date9.);
    NumMth2 = input(NumMth, Best2.);
    Price2 = input(Price, dollar8.);
    Drop DateList NumMth Price;
    Run;
    **Instead of re-write the variable, you can create new ones and drop old ones;
    Data Listing3;
    Set Listing2;
    Format DateList2 Date9. NumMth2 Best2. Price2 dollar8.;
    Run;
    /*Need to make sure the length setting in the informat is correct or conversion might be wrong
    e.g. 100000000 becomes 10000000*/
    /*Convert first, format later*/

    /*Exercise 3 – proc contents display info: check number of vars and obs*/
    Proc contents data = SASHelp.JAC;
    Run;
    /*Exercise 4 – calculate fahrenheit temp and display in 1 decimal place*/
    data temp;
    input hour temp;
    datalines;
    1 19.3
    2 18.1
    3 19.6
    4 20.7
    5 22.3
    6 21.4
    7 19.8
    8 18.1
    9 16.8
    10 17.1
    11 17.2
    12 15.6
    ;
    run;
    Data Temp2;
    Set Temp;
    Format TempFa 4.1;
    TempFa = Temp * 9/5 +32;
    Run;
    /*To define the decimal of a numeric variable, simply add W.D fllowed by the var name*/

    /*Data Manipulation*/
    /*In general data manipulation makes up 60% of SAS program*/

    /*Proc sort to sort the dataset*/
    Data Drinks;
    Infile Datalines dsd;
    Input Type $ Name:$50. Calories Fat;
    Datalines;
    Cold,Caramel Frappuccino® Light Blended Beverage,100,0
    Cold,Coffee Frappuccino® Blended Beverage,180,2.5
    Cold,Teavana® Shaken Iced Passion Tango™ Tea,0,0
    Cold,Shaken Tazo® Iced Black Tea Lemonade,100,0
    Cold,Nonfat Iced Vanilla Latte,120,0
    Cold,Nonfat Iced Caramel Macchiato,140,1
    Cold,Iced Coffee (with classic syrup),60,0
    Cold,Iced Skinny Latte,70,0
    Hot,Brewed Coffee,5,0
    Hot,Brewed Tea,0,0
    Hot,Nonfat Green Tea Latte,210,0
    Hot,Nonfat Cappuccino,60,0
    Hot,Nonfat Caffè Latte,100,0
    Hot,Nonfat Caramel Macchiato,140,1
    Hot,Soy Chai Tea Latte,180,2
    Hot,Nonfat Caffè Mocha – hold the whip,170,2
    Hot,Skinny Vanilla Latte,100,0
    Hot,Steamed Apple Juice,170,0
    Hot,Caffè Americano,10,0
    ;
    Run;
    Proc Sort Data = Drinks;
    By Name;
    Run;
    Proc Sort Data = Drinks;
    By Calories;
    Run;
    Proc Sort Data = Drinks;
    By Descending Calories;
    Run;
    /*In proc sort, the descending needs to be in front of the variable and with full spelling*/
    **The DESCENDING option applies only to the variable immediately after the option.;
    /*Sort by two layers – Type then Fat*/
    Proc Sort Data = Drinks;
    By Type Fat;
    Run;

    /*Creating output dataset*/
    Proc Sort Data = Drinks Out = SortedDrinks;
    By Calories;
    Run;
    /*In this program, a separate output is generated so that the raw data is not overwritten*/
    /*It’s a good practise to keep the original data unchanged and update the output table.*/

    /*Where statement*/
    Proc Sort Data = Drinks Out = HotDrinks;
    By Calories;
    Where Type = “Hot”;
    Run;
    /*In here only ‘where’ can be used, not ‘if’ as ‘If’ can only be used in data step but not proc steps*/

    /*Check what’s the most expensive cars among the five brand*/
    /*First. last. varaibles */
    Proc Sort Data = SASHelp.Cars Out = SortedPrice;
    By Make Descending MSRP;
    Where Make in (“BMW”, “Audi”, “Lexus”, “Mercedes-Benz”, “Porsche”);
    Run;
    Data ExpCars;
    Set SortedPrice;
    By Make;
    If First.Make;
    Run;
    **First. / Last. varaibles capture the first / last observation (which usually represents extreme values) in a sorted group
    The first. / last. variable involved needs to be sorted as well as the KPI variable.
    In this case, the first. variable (the group) is Make and the KPI variable is MSRP
    Need to think these through before writing the commands;

    /*Remove duplicate observations with NODUPKEY*/
    Data Supermarket;
    Infile datalines dsd;
    Input Product : $20. Price DemandPerWeek ;
    Datalines;
    Campbell Soup,3.99,150
    Lay’s Chip,2.99,300
    Kinder Chocolate,5.99,50
    Nestle Ice cream,6.99,80
    Maxwell Coffee,5.99,90
    Coca cola,5.99,300
    Pringles,2.99,200
    Pringles,2.99,200
    Lipton Milk Tea,3.99,150
    Flamingo Fried Chicken,8.99,60
    Dempster’s Bread,1.99,450
    ;
    Run;
    Proc Sort Data = Supermarket Out = Supermarket2 NODUPKEY;
    By Product Price DemandPerWeek;
    Run;
    /*In this example, the duplicated entry is removed in supermarket2 and supermarket stay unchanged.*/
    /*The log will tell you that 1 obs with duplicate key value is deleted*/
    /*the NODUPKEY should be used with caution and you should double check the duplicate key before removing them from the dataset*/

    Data Income;
    Input HouseholdID $ NumMembers HomeOwner $ Income $;
    Datalines;
    HID1001 4 Yes >120000
    HID1002 3 No <120000
    HID1003 2 Yes >120000
    HID1004 6 Yes >120000
    HID1004 6 Yes >120000
    HID1005 6 No <120000
    HID1006 4 Yes >120000
    HID1007 5 Yes <120000
    HID1008 3 Yes >120000
    HID1009 5 Yes <120000
    HID1010 4 No >120000
    ;
    Run;
    /*You need to remove the duplicate key with proc sort*/
    Proc Sort Data = Income Out = Income2 NODUPKEY;
    By HouseholdID NumMembers HomeOwner Income;
    Run;
    /*It’s highly recommended to create a new data set when removing the duplication, to keep original one intact.*/

    /*Flagging Extreme Values – highest and lowest using first. last.*/
    Data Exam;
    Input Subject $ Student $ Results;
    Datalines;
    Math Mary 78
    Math John 67
    Math Tom 98
    Math Chris 56
    Math Amy 89
    English Mary 74
    English John 79
    English Tom 88
    English Chris 92
    English Amy 45
    History Mary 32
    History John 96
    History Tom 55
    History Chris 67
    History Amy 86
    ;
    Run;
    Proc Sort Data = Exam;
    By Subject Results;
    Run;
    Data Exam2;
    Set Exam;
    By Subject Results;
    If first.subject then i = 1;
    else if last.subject then i = 2;
    Run;
    /*Multiple if then conditions can be used with if then else if then else if then else*/
    /*In data step, ‘by’ is equivalant to ‘group by’ in sql*/
    **There is ‘by’ statement in sort as well as in the data step with first. / last.;
    **The ‘by’ statement in data step is important but often neglected. Without the ‘by’ statement the flagging won’t work;

    /*Data set concatenation – Append datasets with same format and variables with Set statement*/
    Data List1;
    Input Sku $ Product $ QTY Price;
    Datalines;
    S1290001 Laptop 50 979
    S1290002 Laptop 30 632
    S1290003 Desktop 100 1299
    S1290004 Desktop 80 1599
    S1290005 Laptop 10 2999
    ;
    Run;

    Data List2;
    Input Sku $ Product $ QTY Price;
    Datalines;
    S1290006 Printer 300 229
    S1290007 Printer 400 467
    S1290008 Printer 100 899
    S1290009 Cable 1200 8.99
    S1290010 Cable 900 7.99
    ;
    Run;
    /*for datafiles with same var names from structutred database, you can append then with Set. e.g. the gse data files*/
    Data List;
    Set List1 List2;
    Run;

    Data P2;
    Set SASHelp.PRDSAL2;
    Run;
    Data P3;
    Set SASHelp.PRDSAL3;
    Run;
    /*Check the total number of obs and separated ones from the log*/
    Data Ptotal;
    Set P2 P3;
    Run;

    **Two common issues when concatenating data sets
    1. Inconsistent variable name
    2. Data Truncation;

    /*Example: Concatenating the data sets with same format but different var names*/
    Data Report1;
    Input PatientID $ Gender $ Age;
    Datalines;
    P0001 M 32
    P0002 F 28
    P0003 M 48
    P0004 F 39
    P0005 M 25
    ;
    Run;
    Data Report2;
    Input PID : $12. Gender $ Age;
    Datalines;
    P00000000001 M 32
    P00000000002 F 28
    P00000000003 M 48
    P00000000004 F 39
    P00000000005 M 25
    ;
    Run;
    Data Report;
    Set Report1 Report2;
    Run;
    **In this example, the columns with the same var names are concastinated while the vars with same info but different names
    are stored in separate rows;
    **To resolve this issue, the names need to be unified. This can be done using the RENAME function;
    Data Report;
    Set Report1 Report2(Rename=(PID = PatientID));
    Run;
    /*Now we check the data, seems like the columns are apended properly.*/
    /*Oh wait, there is a warning in the log saying that multiple lengths are specified and might cause truncation*/
    /*We check back the data and we found it’s true that the PatientID originally from Report2 has been trauncated*/
    /*The length is 8 instead of the 12 defined*/
    **In Set statement that apends the data, the length of the variables are based on the first data set.;
    /*Thus a Length statement needs to be used to define the correct length*/
    Data Report;
    Length PatientID $12.;
    Set Report1 Report2 (Rename = (PID = PatientID));
    Run;
    **In summary, these two issues can be solved with Rename and Length statements;

    data desks1;
    infile datalines dsd;
    length Brand $10 category $12;
    input Brand $ Category $ Price Width Depth Height;
    datalines;
    MICKE,Desk,59.99,73,50,75
    BEKANT,Desk,179.00,140,60,65
    MICKE,Work station,149.00,105,50,140
    VITTSJÖ,Laptop table,39.99,100,36,74
    BESTÅ BURS,Desk,199.00,120,40,74
    ;
    Run;

    data desks2;
    infile datalines dsd;
    length Product $12 Type $28;
    input Product $ Type $ Price WGT DPT HGT;
    datalines;
    HEMNES,Desk with add-on unit,449.00,155,65,137
    MALM,Desk with pull-out panele,179.00,151,65,73
    IKEA PS 2014,Standing laptop station,69.99,45,34,96
    VITTSJÖ,Laptop stand,24.99,55,65,35
    MICKE,Desk with integrated storage,99.00,120,50,75
    ;
    Run;
    /*You can do multiple rename statements.*/
    Data Desks;
    Length Brand $12 category $28;
    Set desks1 desks2(Rename = (Product = Brand Type = Category WGT = Width DPT = Depth HGT = Height));
    Run;

    /*Merge data sets*/
    Data Profile;
    Input Student $ Gender $ Grade;
    Datalines;
    SID0001 M 7
    SID0004 F 7
    SID0002 F 7
    SID0003 M 8
    SID0007 M 6
    SID0005 F 9
    SID0006 F 9
    SID0009 M 7
    SID0008 M 7
    SID0010 F 10
    ;
    Run;

    Data Exam;
    Input Student $ Result;
    Datalines;
    SID0009 70
    SID0001 78
    SID0004 99
    SID0002 90
    SID0003 81
    SID0008 81
    SID0005 78
    SID0006 66
    SID0007 34
    SID0010 66
    ;
    Run;
    **The data sets need to be sorted by merging key before they can be merged;
    **Always sort before using the merge statement;
    /*The three statement could not be run alone if the data sets are not sorted by the key*/
    Proc Sort Data = Profile;
    By Student;
    Run;

    Proc Sort Data = Exam;
    By Student;
    Run;

    Data Combined;
    Merge Profile Exam;
    By Student;
    Run;

    **Two steps merging the data sets;
    **Step 1: sort each of the input data sets by the merging variable(s);
    **Step 2: Merge the data sets by the merging variables;

    Data Cust;
    Input ID Zip ;
    Datalines;
    9000135 14304
    9000134 14301
    9000138 14308
    9000136 14308
    9000137 14304
    9000142 14304
    9000139 14300
    9000140 14301
    9000141 14308
    ;
    Run;

    Data Purchase;
    Input ID Purchase : dollar5.0;
    Datalines;
    9000135 720
    9000133 1002
    9000134 569
    9000140 890
    9000141 780
    9000136 109
    9000137 30
    9000138 4566
    9000139 50

    ;
    Run;

    Proc Sort Data = Cust;
    By ID;
    Run;
    Proc Sort Data = Purchase;
    By ID;
    Run;

    Data PurchaseRecord;
    Merge Cust Purchase;
    By ID;
    Run;

    /*Identify the merging variable and rename them if they are not in the same name*/
    Data Population;
    Input Country $ City: $30. Population;
    Datalines;
    Canada Toronto 6000000
    Canada Montreal 4000000
    Canada Vancouver 2400000
    US Chicago 2700000
    US New_York 8400000
    US Los_Angeles 3800000
    Mexico Mexico_City 8500000
    Mexico Cancun 620000
    ;
    Run;

    Data Uber;
    Input Country $ Cities: $30. NumDriver;
    Datalines;
    US Chicago 20000
    US New_York 14000
    US Los_Angeles 16000
    Canada Toronto 13000
    Canada Montreal 5000
    Mexico Mexico_City 20000
    Mexico Cancun 11000
    Canada Vancouver 7000
    ;
    Run;
    Proc Sort Data = Population;
    By Country City;
    Run;
    Proc Sort Data = Uber;
    By Country Cities;
    Run;
    /*We need to rename the variable name to ensure the consistent names for merging*/
    Data UberData;
    Merge Population Uber(Rename = (Cities = City));
    By Country City;
    Run;
    Data PopulationPerDriver;
    Set UberData;
    PPD = Population / NumDriver;
    Run;
    Proc Sort Data = PopulationPerDriver;
    By PPD;
    Run;

    **What happens if there are unmatched cases from either/both side of the data sets?;
    **The data will still get merged, the unmatched parts will be replaced with ‘.'(numeric) or ‘ ’(char);
    /*It’s the same logic as append */
    Data DS1;
    Input ID Gender $;
    Datalines;
    9001 Male
    9002 Female
    9003 Male
    9004 Male
    9005 Female
    ;
    Run;

    Data DS2;
    Input ID Income;
    Datalines;
    9003 50000
    9004 60000
    9005 75000
    9006 65000
    9007 45000
    ;
    Run;
    Proc Sort Data=ds1;
    By id;
    Run;
    Proc Sort data=ds2;
    by id;
    run;
    Data ds;
    Merge ds1 ds2;
    By id;
    Run;

    **How to identify the unmatched observations;
    **Use if then logic and In function;
    Data ds;
    Merge ds1 (In = a)ds2 (In = b);
    By id;
    If a and b Then i = 1;
    Else If a Then i = 2;
    Else If b Then i = 3;
    Run;

    Data Transac;
    Input Order $ Province : $30. Total;
    Datalines;
    BA00001 Ontario 1432
    BA00002 Ontario 1455
    BA00003 Ontario 2435
    BA00004 Quebec 894
    BA00005 Quebec 1745
    BA00006 Quebec 997
    BA00007 Alberta 562
    BA00008 Alberta 132
    BA00009 Alberta 987
    BA00010 Manitoba 562
    ;
    Run;

    Data Tax;
    Input Province : $30. Tax;
    Datalines;
    Ontario 13
    Quebec 14.975
    Alberta 5
    Manitoba 5
    ;
    Run;

    Proc Sort Data = Transac;
    By Province;
    Run;

    Proc Sort Data = Tax;
    By Province;
    Run;

    Data TaxCal;
    Merge Transac Tax;
    By Province;
    AfterTax = Total * (1+Tax/100);
    Run;

    Data ds;
    Merge ds1 (In = a)ds2 (In = b);
    By id;
    If a and b Then i = 1;
    Else If a Then i = 2;
    Else If b Then i = 3;
    Run;

    /*All can be matched from both sides as i = 1*/
    Data Ck;
    Merge Transac (In = a) Tax (In = b);
    By Province;
    If a and b Then i = 1;
    Else If a Then i = 2;
    Else If b Then i = 3;
    Run;

    /*(In =) Option*/
    Data DS1;
    Input ID Gender $;
    Datalines;
    9001 Male
    9002 Female
    9003 Male
    9004 Male
    9005 Female
    ;
    Run;

    Data DS2;
    Input ID Income;
    Datalines;
    9003 50000
    9004 60000
    9005 75000
    9006 65000
    9007 45000
    ;
    Run;

    Proc Sort Data=ds1;
    By id;
    Run;

    Proc Sort Data=ds2;
    By id;
    Run;

    /*Although a and b are mentioned in the data step, they are not visible in the dataset*/
    data ds;
    Merge ds1 (In=a) ds2 (In=b);
    By id;
    Run;

    /*Since a and b are mentioned in data step, they can be referred in your program*/
    /*It can be used to create varaibles*/
    Data ds;
    Merge ds1(In = a) ds2 (In = b);
    By id;
    Origds1 = a;
    Run;
    /*Var Origds1 will be flaged as 1 if true, 0 if false*/

    /*Transposing data set with proc transpose*/
    **Trasposing data set is a common data manipulation task;
    Data Rev;
    Input Month $ Revenue;
    Datalines;
    Jan 253000
    Feb 460000
    Mar 365000
    ;
    Run;
    /*Id is the row to be and Var is the ob to be*/
    Proc Transpose Data = Rev Out = t_Rev;
    Id Month;
    Var Revenue;
    Run;
    **Three parts of the Proc Transpose:
    1.Out Option: Creates the transposed dataset. If out if not defined, the transposed data set will not be output/saved.
    2.ID Option: Specifies the variable to be transposed as the column headers (rows)
    3.Var Statement: Specifies the results to be transposed;

    Data TEMP;
    Input Hours $ Temperature;
    Datalines;
    H0 15
    H1 13
    H2 13
    H3 14
    H4 14
    H5 14
    H6 13
    H7 14
    H8 14
    H9 14
    H10 14
    H11 15
    H12 16
    H13 17
    H14 16
    H15 16
    H16 16
    H17 15
    H18 15
    H19 14
    H20 14
    H21 14
    H22 14
    H23 15
    ;
    Run;

    Proc Transpose Data = Temp Out = t_Temp;
    ID Hours;
    Var Temperature;
    Run;
    /*Use Sum(of V1, V2, V3) to sum the variables*/
    Data CleanTemp (Drop = _NAME_);
    Set t_Temp;
    Bf12pm = Sum(of H0, H1, H2, H3, H4, H6, H6, H7, H8, H9, H10, H11);
    Af12pm = Sum(of H12, H13, H14, H15, H16, H17, H18, H19, H20, H21, H22, H23);
    Run;

    **When reverse transposing a data set, the ID statement is not needed.;
    /*13-18*/
    Data Q2;
    Input Store $ month $ profit;
    Datalines;
    Store1 April 169000
    Store2 April 105000
    Store1 May 104000
    Store2 May 130000
    Store1 June 178000
    Store2 June 145000
    ;
    Run;
    **The problem when transposing this data set is that there are more than 1 observation in each of the ID groups.;
    /*This is an example that leads to an error*/
    Proc Transpose Data = Q2 Out = t_Q2;
    ID Month;
    Var Profit;
    Run;
    /*This cannot be executed because the ID value April occurs twice and it’s not acceptable as ID parameter*/
    /*Now it’s time to introduce By Statement*/
    Proc Sort Data = Q2;
    By Store;
    Run:
    /*This command must be executed first as proc sort should be always used before By is used in other steps
    If the Proc Transpose with By statement are run alone, they cannot be properly executed*/
    /*With only 1 observation per store, SAS is able to transpose the data set without error*/
    Proc Transpose Data = Q2 Out = t_Q2;
    ID Month;
    Var Profit;
    By Store;
    Run;
    /*Now it becomes a crosstab*/

    Data Flight;
    Infile datalines dsd;
    Input From : $30. To : $30. Airline : $30. Price;
    Datalines;
    Toronto,New York,Air Canada,600
    Toronto,New York,American Airline,400
    Toronto,New York,Virgin Airline,700
    Toronto,Calgary,Air Canada,550
    Toronto,Calgary,American Airline,750
    Toronto,Calgary,Air Transat,600
    Toronto,Hong Kong,Air Canada,1950
    Toronto,Hong Kong,Cathay Pacific,1850
    Toronto,Hong Kong,Hainan Airline,1600
    ;
    Run;

    Proc Sort Data = Flight;
    By From To;
    Run;

    Proc Transpose Data = Flight Out = t_Flight;
    ID Airline;
    Var Price;
    By From To;
    Run;

    /*Getting the Min fare should be a separate step*/
    Data Cheapest;
    Set t_Flight;
    MinFare = Min(Air_Canada, American_Airline, Virgin_Airline,Air_Transat,Cathay_Pacific,Hainan_Airline);
    Run;

    /*Reverse Transpose*/
    **Long to wide is transpose and wide to long is reverse transpose;
    **The command is still proc transpose;
    Data Food;
    Input Restaurant : $30. Coffee Muffin Cookies;
    Datalines;
    McDonald’s 1.75 1.5 1.3
    ;
    Run;

    Proc Transpose Data = Food Out = rt_Food;
    Var Coffee Muffin Cookies;
    Run;
    /*This statement will transpose the data set and the restaurant variable is also gone.*/
    **The Var variable lists the variables to be reverse transposed.;
    **The ID statement is only needed when doing stardard transpose and it’s not needed for reverse transpose;

    Data t_q2;
    Input Store $ _name_ $ April May June;
    Datalines;
    Store1 profit 169000 104000 178000
    Store2 profit 105000 130000 145000
    ;
    Run;
    /*Perform a reverse transpose on T_Q2 and combine the profit from April, May, and June back into one column.*/
    Proc Transpose Data = t_q2 Out = rt_q2;
    Var April May June;
    By Store;
    Run;
    /*Think throught what’s the Var and By in the proc transpose*/

    /*Variable _N_ and _ERROR_*/
    Data Temp;
    Input Num1 Num2;
    Datalines;
    100 5
    101 4
    102 3
    103 2
    104 1
    105 0
    106 -1
    107 -2
    108 -3
    109 -4
    110 -5
    ;
    Run;
    **Variable _N_ shows the observation number in the data set.;
    **Similiar to the ‘a’ in (in = a) option, variable _n_ does not exsist in the data set;
    **To really see the variable _N_, you need to copy _N_ into a variable that is visible in the data set;
    Data Temp2;
    Set Temp;
    obsnum = _N_;
    Run;
    **Variable _ERROR_ is another automatic variable built in the data step;
    **It identifies the observation with error;
    Data Temp3;
    Set Temp;
    Num3 = Num1 / Num2;
    obserr = _ERROR_;
    Run;
    **_ERROR_ flags the observation as ‘1’ when there is an error, and ‘0’ when there is not.;
    **You can use _ERROR_ to identify the problematic observation;
    Proc Sort Data = SASHelp.Cars Out = Cars;
    By Descending Horsepower;
    Run;

    Data Top3HP;
    Set Cars;
    If _N_ in (1, 2, 3);
    Run;
    /*Use ‘if’ and not ‘where’ in the above case as where only indexes the variable in the data set*/
    /*In proc statements, only where but not if can be used*/

    /*Retain Statement*/
    Data Revenue;
    Infile datalines dsd;
    Input Week : $10. Revenue;
    Datalines;
    Week 1, 4000
    Week 2, 5000
    Week 3, 8000
    Week 4, 6500
    Week 5, 8900
    Week 6, 4500
    Week 7, 9800
    Week 8, 10000
    Week 9, 6800
    Week 10, 6300
    ;
    Run;
    Data Revenue2;
    Set Revenue;
    If _N_ = 1 Then Total = 0;
    Run;
    /*In this case, the first ob will have the Total var = 0 but the rest have missing value for that var.*/

    /*Now we try using the retain varaible*/
    Data Revenue2;
    Set Revenue;
    Retain Total;
    If _N_ = 1 Then Total = 0;
    Run;
    /*After it’s executed, all other values for Total are filled with 0*/
    **The retain statement retrains the value of ‘0’ from the first ob to the next’
    **The first ob from total is assigned the value of ‘0’. This value retains to the next ob and as a result,
    the entire column of Total consists of ‘0’;

    **The main application of the retain statement is to perform cumulative summation.;
    Data Revenue3;
    Set Revenue;
    Retain Total;
    /*If _N_ = 1 Then Total = 0;*/
    Total = Sum(Revenue, Total);
    Run;
    /*It means that the summation result will be carried over from row to row.*/
    /*If the retain condition is not set in the program above, Total will be equal to Revenue as Total = 0 in the first place*/
    **Note: setting the initial value to the variable TOTAL in this program is actually not required.;

    Data Inventory;
    Infile datalines dsd;
    Input Product : $30. Store : $10. Qty;
    Datalines;
    Billy’s Bookshelf,Store 1, 150
    Billy’s Bookshelf,Store 2, 200
    Billy’s Bookshelf,Store 3, 230
    Billy’s Bookshelf,Store 4, 160
    Billy’s Bookshelf,Store 5, 180
    Billy’s Bookshelf,Store 6, 220
    Billy’s Bookshelf,Store 7, 130
    Billy’s Bookshelf,Store 8, 450
    Billy’s Bookshelf,Store 9, 260
    Billy’s Bookshelf,Store 10, 300
    ;
    Run;

    Data Inventory2;
    Set Inventory;
    Retain Total;
    Total = Sum(Qty, Total);
    Run;
    /*Firstly identify which one is the var to retain and what’s the total variable*/

    /*Assigning initial values with Retain Statement*/
    **There are 3 ways to assign initial values with a Retain statement:
    **Retain Var1 0;
    Data Revenue2;
    Set Revenue;
    Retain Total 0;
    Run;
    **Retain Var1 . (missing value);
    Data Revenue4;
    Set Revenue;
    Retain Total .;
    Run;

    **-Retain Var1;
    /*Retain Statement is to assign the starting value. It’s usually the three mentioned above, it can also be assigned to numbers such as 100*/
    Data Revenue2;
    Set Revenue;
    Retain Total 100;
    Run;
    /*If you assign Retain Var1 without any varaible definition fomula, the variable will not be created*/
    /*See the example below*/
    Data Revenue5;
    Set Revenue;
    Retain Total;
    Run;
    **When computing cumulative summation, assigning initial value is not needed and the summation can be carried out;
    Data Revenue3;
    Set Revenue;
    Retain Total;
    Total = Sum(Revenue, Total);
    Run;
    /*If you in the case above specifying the Total as 100, 100 will be added since the first Total calculation*/
    /*You can see that from the example below: assuming there is an initial inventory of 1500*/
    Data Inventory;
    Infile datalines dsd;
    Input Product : $30. Store : $10. Qty;
    Datalines;
    Billy’s Bookshelf,Store 1, 150
    Billy’s Bookshelf,Store 2, 200
    Billy’s Bookshelf,Store 3, 230
    Billy’s Bookshelf,Store 4, 160
    Billy’s Bookshelf,Store 5, 180
    Billy’s Bookshelf,Store 6, 220
    Billy’s Bookshelf,Store 7, 130
    Billy’s Bookshelf,Store 8, 450
    Billy’s Bookshelf,Store 9, 260
    Billy’s Bookshelf,Store 10, 300
    ;
    Run;
    Data Inventory2;
    Set Inventory;
    Retain Total 1500;
    Total = Sum(Qty, Total);
    Run;

    /*(+) Expression*/
    **Combining the use of the Retain and Sum function allow you to compute the summation across the obs. The ‘+’ is a shortcut to also do that;
    Data Inventory;
    Infile datalines dsd;
    Input Product : $30. Store : $10. Qty;
    Datalines;
    Billy’s Bookshelf,Store 1, 150
    Billy’s Bookshelf,Store 2, 200
    Billy’s Bookshelf,Store 3, .
    Billy’s Bookshelf,Store 4, 160
    Billy’s Bookshelf,Store 5, .
    Billy’s Bookshelf,Store 6, 220
    Billy’s Bookshelf,Store 7, 130
    Billy’s Bookshelf,Store 8, 450
    Billy’s Bookshelf,Store 9, 260
    Billy’s Bookshelf,Store 10, 300
    ;
    Run;
    Data Inventory2;
    Set Inventory;
    Total + Qty;
    Run;
    **The code does the following:
    1. Creates the variable Total with no initial value (missing)
    2. The varaible total has a built-in retain capability. Its value is retained from one ob to the next.
    3. The (+Revenue) syntax adds the revenue to the varaible Total on each ob.

    **The (+) can also work in the case of missing values;

    /*Exercise 1*/
    /*Sort the data set by highest selling price (i.e. HIGH) in descending order*/
    Data Stocks;
    Set SASHelp.Stocks;
    Run;
    Proc Sort Data = Stocks Out = SortedStocks;
    By Stock Descending High;
    Run;

    /*Exercise 2*/
    /*Create a data set that contains the highest and lowest trade volume for each stock.*/
    Proc Sort Data = Stocks Out = RStocks;
    By Stock Volume;
    Run;
    /*Must think through what’s the sort variable and what’s the First. / Last. variable*/
    /*There is a by statement in first./last. operations and proc sort needs to be done before that.*/
    Data MLStocks;
    Set RStocks;
    By Stock Volume;
    If First.Stock or Last.Stock;
    Run;

    /*Exercise 3*/
    /*Find out the difference between the highest and lowest selling price for each stock between January 1, 1990, to December 31, 1999.*/
    Data Ck;
    Length date $10;
    Input date $;
    Datalines;
    1990-01-01
    1999-12-31
    ;
    Run;
    /*to convert to numeric, the parameter should be it’s original format*/
    Data Cvt;
    Set Ck;
    date2 = input(date, yymmdd10.);
    Run;
    /*10958 correspond to January 1, 1990 and 14609 correspond to December 31, 1999*/
    Data DStocks (Keep = Date Stock High Low Diff);
    Set Stocks;
    Diff = High – Low;
    Date2 = Date;
    Where Date >= 10958 and Date <= 14609;
    Run;

    /*Model Answer:*/
    Proc Sort Data=sashelp.stocks out=high;
    by stock high;
    where date>=’01JAN1990’D and date<=’01DEC1999’D;
    run;
    /*To refer to the date, we can simply use date>=’01JAN1990’D*/

    /*Model Answer:*/
    /*I understood wrongly. It’s the highest – lowest in each stock (i.e. brand) not each entry*/
    Proc Sort Data=sashelp.stocks out=high;
    by stock high;
    where date>=’01JAN1990’D and date<=’01DEC1999’D;
    run;
    data high2;
    set high;
    by stock high;
    if last.stock;
    keep stock high;
    run;
    Proc Sort Data=sashelp.stocks out=low;
    by stock low;
    where date>=’01JAN1990’D and date<=’01DEC1999’D;
    run;
    data low2;
    set low;
    by stock low;
    if first.stock;
    keep stock low;
    run;
    data diff;
    merge high2 low2;
    by stock;
    diff = high-low;
    run;
    Proc Sort Data=sashelp.stocks out=high;
    by stock high;
    where date>=’01JAN1990’D and date<=’01DEC1999’D;
    run;
    data high2;
    set high;
    by stock high;
    if last.stock;
    keep stock high;
    run;
    Proc Sort Data=sashelp.stocks out=low;
    by stock low;
    where date>=’01JAN1990’D and date<=’01DEC1999’D;
    run;
    data low2;
    set low;
    by stock low;
    if first.stock;
    keep stock low;
    run;
    data diff;
    merge high2 low2;
    by stock;
    diff = high-low;
    run;

    /*Exercise 4*/
    /*Find out the product that shows the highest return rate across all of the regions.*/
    Data Shoes;
    Set SASHelp.Shoes;
    Rate = Returns / Sales;
    Run;
    Proc Sort Data = Shoes Out = StShoes;
    By Region Rate;
    Run;
    /*There must be a by statement in first. / last. and the by statement is the same as in proc sort in previous step*/
    Data HRateShoes;
    Set StShoes;
    By Region Rate;
    If Last.Region;
    Run;
    Proc Sort Data = HRateShoes Out = HHRate;
    By Descending Rate;
    Run;

    /*Model Answer:*/
    Data Shoes;
    Set sashelp.shoes;
    rate = returns / sales;
    arbi = 1; * Arbitrary variable *;
    Run;
    Proc sort data=shoes;
    by rate;
    Run;
    Data Shoes2;
    Set shoes;
    by arbi rate;
    if last.arbi;
    Run;
    **Please note that an arbitrary variable is needed when there is no grouping variable.;
    /*In this case, we can just use arbitrary varaible to get the highest return rate among
    all regions as highest among all regions -> highest in this data set. We only need to group by regions when we need to find
    the highest in each region*/

    /*Exercise 5*/
    /*How to get subtotal by categories*/
    /*The total bonus to be given out will be 5% of the total sales across all region. Calculate the total bonus to be given out to all regional managers.*/
    /*Use retain and first last*/
    Proc Sort Data = Shoes;
    By Region;
    Run;

    Data TotalR;
    Set Shoes;
    By Region ;
    If First.Region Then TotalR = 0;
    TotalR +Sales;
    Arbi = 1;
    If Last.Region;
    Keep Region Arbi TotalR;
    Run;

    /*Exercise 6*/
    /*Get contribution*/
    Data TotalC;
    Set TotalR;
    By Arbi;
    Retain Total;
    Total = Sum(TotalR, Total);
    If Last.Arbi;
    Keep Region Arbi Total;
    Run;
    /*You need to specify that the merge is by Arbi, or else it will be merged by region*/
    /*If it’s merged by region, only western europe will get merged and the rest will be missing values, while if use arbi, all can be merged*/
    /*The good use of Arbitrary variable*/
    Data Contrib;
    Merge TotalR TotalC;
    By Arbi;
    Contrib = TotalR/Total;
    Run;

    /*Model Answer:*/
    /*The good use of Merge*/
    Data Shoes;
    Set sashelp.shoes;
    arbi = 1;
    Run;

    Data Total;
    Set Shoes;
    By Arbi;
    Total+Sales;
    If last.arbi;
    keep Arbi total;
    Run;

    ** Calculate sales from each region **;
    Proc Sort Data=Shoes;
    By region;
    Run;

    Data TotalR;
    Set Shoes;
    By region;
    if first.region then totalR=0;
    Arbi = 1;
    TotalR+Sales;
    If last.region;
    keep Region Arbi totalR;
    Run;

    ** Calculate percentage contribution and bonus **;
    Data Contri;
    Merge TotalR Total;
    By Arbi;
    ContriPct = TotalR / Total;
    Bonus = ContriPct * Total * 0.05;
    Run;

  • [TTStats] What is a t-test?

    In 2014/2015, the material used to make table tennis balls changed from celluloid to a non-flammable plastic. These balls are a tiny bit bigger than the 40mm celluloid balls, but the main difference is the type of material they are made from. According to the information online, the balls have less spin and bounce after the change of materials.

    As a data person, I want to measure whether there are big differences in the bounciness of the two types of balls. I get a box of plastic balls and another of celluloid balls, both of which have 10 pieces of new balls inside. The way I measure the bounciness is that I let each ball free fall at the height of 300mm onto the table tennis table and measure what’s the height of its first bounce.

    After 20 free falls, I have obtained the following data:

    1 2 3 4 5 6 7 8 9 10
    Plastic 219.831 218.499 219.166 219.5 220.332 218.666 220.333 219.164 220.021 220.332
    Celluloid 227.489 229.833 231.164 228.33 231.152 232.665 227.658 230.332 230.49 229.89

    With these data, we can calculate the average of the height of each type of balls’ first bounces:

    Plastic 219.584mm
    Celluloid 229.9mm

    From the mean figures, we can see that the average bounce height of plastic balls is almost 10mm smaller than celluloid balls. However, what does it really tell us? Does it mean that there is a big difference between the heights of the first bounce of the two types of balls? Is it possible that the 10mm differences are just by chance (e.g. in a new test, plastic balls may have higher first bounce heights than the celluloid balls)? How can we draw the conclusion that there are a big enough differences in the heights that we can prove that the difference in the bouncy level does exist?

    The distribution of the heights is plotted on the picture below. There are obvious differences and the minimum of the celluloid height is already greater than the maximum height of plastic, however, is it conclusive enough?

    In order to figure out this question, we need to introduce the t-test.

    The t-test is a commonly used technique that measures the differences between sample means.  It calculates the differences between the groups and compares it to the differences within the groups

    T= variance between groups/variance within groups

    A big t-value indicates different groups and a small t-value indicates similar groups.

    How do we know a t-value is big enough to decide that the groups are very different? Each t-value corresponds to a p-value which indicates the possibility of ‘there is no real difference between the groups’. A p-value of less than 0.05 says that the probability of two groups being indifferent is less than 5%. The smaller the p-value, the more likely that the two means are different. Usually, we take 5% as a threshold of judging whether the difference is significant.

    Now back to our example, we run a t-test on the data we collected and below are the outputs:

    The p-value is 3.955e-10, which means that there is less than 0.00001% probability that the average bounce heights of plastic balls and celluloid balls are indifferent. This proves that the bounce heights of plastic balls are significantly lower than the celluloid balls and they are much less bouncy.

    The t-test that we use to compare the means between two groups is called an independent t-test. There are also paired t-test and one-sample t-test.

    The paired t-test is to test the means of the same group twice. The advantage of this t-test is that it reduces possibly variability of subjects, however, it’s prone to order effects. e.g. the body temperature of a group of runners before and after a marathon.

    The one-sample t-test is to compare the group mean with a hypothesized/known sample mean. e.g. the average IQ of a group of Cambridge graduate vs. 100

    *The plastic vs. celluloid ball example I gave is just for storytelling purpose. I did not conduct that experiment. The tests and the figures are from the experiment done by Preston Table Tennis Association. If you are interested in the experiment, kindly check out their video: Plastic & Celluloid Table Tennis Balls Compared: Bounce and Conformity

  • STTA Coaching Course – here is the info you need

    I summarised some information about the STTA Coaching Course (ITTF-PTT Level 1 Coaching Course) and compiled it in Q&A formats. For detailed information, check out the STTA information page and Coaching Course introduction.

    When does the course take place?

    -It takes place twice a year and the time is announced and confirmed by STTA. The next one is  16-24th November (TBC)

    Is this course in English or Mandarin?

    -The course is delivered in English and the final assessment needs to be conducted in English.

    How much do I need to pay for the course?

    -$374.50 (payment by cheque)

    Who will be delivering the course?

    -It’s gonna be a qualified International Course Conductor appointed by ITTF

    What’s the size of the class?

    -20 to 30 students

    What’s the venue?

    -STTA Training Hall – 297C Lorong 6 Toa Payoh Singapore 319389

    How many hours do I need to spend to get the Level 1 Certificate?

    -The course length is 30 hours and you need to complete 30 hours of practical coaching to get the certificate. The course takes place in a consecutive 9 days (weekday nights and weekends).

    I cannot make it to all the sessions. Can I skip one or two classes?

    -All sessions are compulsory. You cannot pass the course if you miss any of the session.

    I heard there is a prerequisite test. What skill level do I need to pass the test?

    -You need to have the fundamental skills in table tennis (e.g. attack, drive, footwork) and you will be tested on multi-ball feeding. The test is scheduled months before the course and you need to pass it to sign up for the course. No payment needed if you do not pass the test.

    What’s taught in the course?

    -The course is based on the ITTF-PTT L1 coaching course Syllabus.

    Is the course classroom-based or practical?

    -Theories are taught in the classroom. There may be practical sessions.

    Are there assessments for this course?

    -There is a practical assessment a the end of the course, in which you need to put what you learned into practice.

    Is it possible that I fail the coaching course?

    -Yes, there is a possibility that you fail the coaching course. Not many people fail it so don’t worry about it!

    Feel free to comment to add in more information!

  • DHS Hurricane Rubbers – What are the differences?

    How are Hurricane 2, Hurricane 3, NEO Hurricane 3, Hurricane 8 and Hurricane 3-50 different?

    -Hurricane 2 was developed in 2000 after Wang Liqin and Yan Sen won the gold medal in Olympics men’s doubles. It creates high speed and spin whilst maintaining stable control. In addition, it generates strong near-table offensive power to make the ball fly in a low arc which frustrates opponent’s defences.

    -Hurricane 3 is designed for players who mainly adopt a controlled playing style or have a relatively weak attack. The rubber can add strength and produce a long arc, which increases the ball control and exerts fast attack and loop drive close-to-table.

    -NEO rubbers were designed for the era of inorganic glue. It creates a stable and high-speed arc and improves the scoring ability of the players. NEO Hurricane 3 has a high speed, low fast loop, changeful arc and stronger spin.

    -Hurricane 8 is a new generation of sticky-speed rubber. Hurricane 8 is specially designed to enhance the spin and speed for 40+ Celluloid-Free T.T. balls. It keeps a good balance between flexibility and elasticity.

    -Hurricane 3-50 is made of tacky rubber with elastic sponge. The sponge is receptive to ball contact while the rubber friction grips and holds the ball. Such a combination provides the player with great ball feeling with every shot and suits players who want to exert a fast, stable loop attack.

    (Information above is from DHS official website)

    While unboosted Hurricane 3 can be hard and not bouncy, a nicely boosted Hurricane 3 can have incredible hand feeling, low throw and great control. However, it’s too much of a hassle for many people.

    Neo Hurricane 3 is a bit like boosted H3. It has good spin, control and power but you still need to boost it to maintain the performance after the boosting effect goes off.

    Personally, I prefer Hurricane 8. It is friendly to beginners who want to switch from European/Japanese rubbers to Chinese rubbers as it has good elasticity and flexibility. In a way, it does not require the best strength but it trains you to use your own strength to improve the performance. It a tacky rubber with decent spin and speed. It might be less powerful than boosted Hurricane 3 but it’s good enough to use for beginner / intermediate level players. Most importantly, you don’t have to boost it throughout!

    (Information above is just personal opinion. Feel free to discuss/comment)

  • Best shots of Xu Xin the showman (aka the cloudwalker in table tennis)

    Xu Xin, the left-handed penhold player is famous for his magical shots and amazing footwork. In this video, you can watch Xu Xin’s signature shots (forehand sidespin, lob, forehand attacks, chop block etc) in international competitions as well as in internal training.

    Speaking about his unique style, Xu Xin mentioned that it was developed since he was young and perhaps it’s related to his personality: now matter what sports he is into, he likes the athletes who are artistic and creative.

    This video was posted by 全经典乒乓 on Youtube.

  • What are the weights of Chinese national paddlers’ blades?

    This article is translated from a Chinese Post: 乒乓球拍地板重量该如何选?国手告诉你

    Wang Liqin: I prefer heavier blades of around 95 grams

    Hao Shuai: I like heavier blades of 90-93 grams. It has better hand feel. If I use lighter blades, I may miss the ball sometimes or hit the ball with the edge of the racket.

    Fan Zhendong: I like blades of around 90 grams

    Yan An: My blade is around 88 grams. Heavier blades are more powerful but I am not outstanding in power, thus I choose a lighter one which allows better speed in attacking. It is also quite powerful at back-court.

    Qiu Yike: I prefer it a bit heavier but not too much. Around 96-98 grams. If it is too light, it won’t be powerful enough.

    Xu Xin (penhold): My blade is around 86 grams. I have good strength. People with less strength can choose lighter blades. I choose the heavier one because I feel I can handle it. Heavier blades are more powerful and easier to gain points.

    Chen Meng: I like lighter ones, around 86 grams. It allows me to be faster in forehand and backhand transition.

    Fan Ying (Chopper): I like blades with weights of 86-88 grams. As a chopper, I need to be faster in transition.

    I put players’ faces in the same sequence, for your reference.

  • Table tennis video tutorials for beginners

    Here are some YouTube playlists of table tennis that I found useful as a beginner:

    PingSkills

    My favourite English Channel is PingSkills. It is very beginner-friendly and explains table tennis techniques and rules in a simple and clear way.

    全民学乒乓 横拍篇(Table Tennis for Beginners – Shake-Hands)

    全民学乒乓 直拍篇(Table Tennis for Beginners – Penhold)

    My favourite Mandarin Channel is yunpeng guo’s channel. Guo is the founder of pingpangwang.com and he creates various comprehensive videos for different levels of players. He explains table tennis techniques in great details and he is very spot-on when pointing out the common issues of beginners.

    Here are some tutorials from table tennis world champions with English Subs:

    Shake-Hands Tutorial from Ma Long

    Penhold Tutorial from Wang Hao

  • Rules for equipment and attire in table tennis competitions

    Here is the summary of a few rules for equipment and attire commonly applied in table tennis competitions in Singapore as at the end of 2017.

    1. Three-star size 40+ white balls are used

    2. White attires are not allowed

    3. It is the responsibility of each player to ensure that racket coverings are attached to their racket blade with adhesives that do not contain harmful volatile solvents

    4. Playing clothing shall consist of a short-sleeved or sleeveless shirt and shorts or skirt or one-part sports outfits, socks and playing shoes; other garments, such as part or all of a tracksuit, shall not be worn during play except with the permission of the referee

    These rules are just for your reference and some competition follow stricter rule and standards.

    You may refer to the ITTF Handbook 2017 for the international standards and the complete set of rules.

  • What to bring to my first table tennis meetup?

    Maybe you always have a passion for table tennis but have not played for two or three years.

    Maybe you have not played since secondary school and want to pick it up as a hobby after start working/retirement.

    Now you have signed up for your first table tennis session after a long while. What do you need to bring?

    Here is a checklist based on my personal experiences:

    • Your racket
    • Sportswear including t-shirt, shorts and shoes
    • Water bottle (they usually have water fountain at the venues)
    • Towel
    • White 40+ or D40+ table tennis ball (bring it if you are able to buy it. If not, other players usually have them for you to play/borrow

     

  • Where can I play table tennis in Singapore?

    Here are some popular sports halls for playing table tennis. You can find more details and book them on ActiveSG.

    Bishan Sports Hall

    Choa Chu Kang Sports Hall

    Clementi Sports Hall

    Jurong East Sports Hall

    Yishun Sports Hall

    MOE (Evans) Sports Hall

    Sengkang Sports Hall

    Jurong East Sports Hall

    Jurong West Sports Hall

    Pasir Ris Sports Hall Table Tennis Area

    Our Tampines Hub – Community Auditorium

    There are also places to play table tennis in community centers. You can book on onePA Table Tennis Facility Online Booking System or book over the counter at the CCs.

    Ace the Place CC

    Bedok CC

    Canberra CC

    Chong Pang CC

    Kampong Ubi CC

    Keat Hong CC (register over the counter only)

    Pek Kio CC

    Thomson CC (register over the counter only)

    Toh Payoh Central CC

    Yew Tee CC (register over the counter only)

    If you want to find individual table tennis rooms for multi-ball training, one on one coaching etc., you can check out:

    Serangoon Swimming Complex (4 individual table tennis rooms)

    Toa Payoh South CC (1 table tennis room made of squash court)

    If you love the kampung spirit and want to play with various friendly table tennis enthusiasts, just grab your racket and go:

    Bukit Merah CC (no aircon/3 tables/need to play doubles or wait for your turn during peak hours/free admission/serious players welcomed)

    There are plenty of other places that you can play table tennis other than the places listed above. Just get started from one place and ask the new table tennis buddies you get to know to introduce you the good places that they usually go.
    Now you may wonder, how can I find table tennis buddies?