TTStats Archives - PINGPONG SG

SAS Notes

by pingpongsg on October 2, 2018

**https://www.sascrunchtraining.com*;
Data Test;
a = 1;
Run;

Data Number;
Var1 = 123;
Var2 = 356;
Var3 = 923;
Run;

/*Input variable*/
Data Food;
Restaurant = ‘Burger King’;
NumEmploy = 5;
Location = ‘Toronto’;
Run;

/*Multiple vars*/
Data SCORE;
input Score1 Score2 Score3;
Datalines;
77 88 35
93 57 74
67 85 71
;
Run;

/*Char vs. Num in datalines*/
/*example of wrong codes*/
Data Test;
Input Name Gender Age;
Datalines;
John Male 31
Mary Female 23
Mark Male 45
;
Run;
/*Name and Gender will be NA*/

/*dolalr sign ($) is needed for creating char vars with datalines*/
/*dollar signs need to be after the var name*/
Data Test;
Input Name $ Gnder $ Age;
Datalines;
John Male 31
Mary Female 23
Mark Male 45
;
Run;

Data Profile;
Input PAT_ID $ Enrol $ Bscore;
Datalines;
P001 Yes 99
P002 Yes 101
P003 No 125
;
Run;
/*the input of datalines should straightly be following the data table, not the variable*/

/*length limit of characters*/
Data Test;
Input Name $ Gender $;
Datalines;
Christopher Male
Elizabeth Female
MacDonald Male
;
Run;

**If not pre-denfined, the default length of a char var in SAS is 8. Anything longer than that will be cut off.;
/*Thus in the example above, the names will be trauncated. */
Data Test;
Length Name $12;
Input Name $ Gender $;
Datalines;
Christopher Male
Elizabeth Female
MacDonald Male
;
Run;
/*the length definition should be before input statement or it won’t work – think about the sas execution logic*/

Data Profile;
Length PAT_ID $10;
Input PAT_ID $ Enrol $ Bscore;
Datalines;
PAT3000001 Yes 99
PAT3000002 Yes 101
PAT3000003 No 125
;
Run;
/*view your output to check, in case you miss out char definition / length definition*/

/*set dataset – set statement is used to copy a data set into another*/
Data Test2;
Set Test;
Run;
/*subset a dataset */
Data Test2;
Set Test;
If Gender = ‘Male’;
Run;
**You can use both Where and If to subset the dataset. ‘Where’ is more efficient when operating on larger data as it directly subset
from raw data. ‘If’ will get all the data and then filter. However, ‘If’ can tell you how many variables are in the original dataset
vs. current data set to give you a sense of how many have been dropped;
Data Test3;
Set Test;
Where Gender = ‘Male’;
Run;

Data Enrol;
Set Profile;
If Enrol = ‘Yes’;
Run;

/*accessing other libraries*/
Data Cars;
Set SASHELP.Cars;
Run;

Data JanJol;
Set SASHELP.Holiday;
If month = 1;
Run;

/*accessing your own sas datasets*/
Libname Proj1 “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders”;
Libname Proj2 “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\DS1”;

/*copy out a dataset from library*/
Data Income;
Set Proj2.Income;
Run;
/*save dataset in work to folder*/
Data Proj1.CP951;
Set SASHELP.CP951;
Run;

/*export sas data file to excel*/
Proc Export Data = Income
OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Income.csv”
Replace
Dbms = csv;
Run;
/*replace option allows you to overwrite the file if it’s already exsisted in the directory path. we should always include whit option*/
/*dbms specifies the file type for exporting*/
Proc Export Data = SASHelp.Electric
OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Electric.csv”
Replace
Dbms = csv;
Run;
/*need to check why can only export csv but not xls/xslx*/

/*exporting data to text file*/
Proc Export Data = Income
OutFile = “C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Income.txt”
Replace
Dbms = tab;
Run;
/*for text, it’s xxx.txt dbms = tab*/

/*export can also be done by right clicking the data file and define the path to export*/

/*Exercise 1*/
Data Rate;
Bank = ‘Bank of America’;
Rate = 3.23;
Year = 5;
Run;
/*every var need to end with a semicolon*/

/*Exercise 2*/
Data Demo;
Input ID $ Race $ Gender $ Age;
Datalines;
APP001 Black Male 30
APP002 Caucasian Female 25
APP003 Asian Male 24
APP004 Black Female 32
APP005 Caucasian Female 26
;
Run;
/*Semicolon is needed after Dataline statement*/

/*Exercise 3*/
Data Perch;
Set SASHelp.Fish;
If Species = ‘Perch’;
Run;

/*Coding Project 1*/
/*data import*/
Proc Import datafile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\kpi.txt’
out = kpi
dbms = tab
replace;
delimiter = ‘ ‘;
Run;
/*another way of importing with datalines*/
Data kpi2;
Input Store $ Revenue Staff Salary Operation Profit Complaint Turnover;
Datalines;
STORE101 128000 18 29200 15200 83600 5 2
STORE102 158000 17 19000 12000 127000 11 2
STORE103 138000 18 26300 10500 101200 7 1
STORE104 101000 17 19700 19700 61600 5 2
STORE105 123000 15 29500 10400 83100 7 1
STORE106 189000 13 24400 12600 152000 5 2
STORE107 135000 10 24800 11900 98300 5 2
STORE108 130000 14 19400 11000 99600 3 1
STORE109 191000 12 28300 10500 152200 8 2
STORE110 176000 10 23500 15900 136600 9 1
;
Run;
/*subset*/
Data Gkpi;
Set Kpi;
If Revenue / Staff > 10000;
Run;

/*math operations*/
Data Numbers;
Input Val1 Val2;
Datalines;
1 26
4 49
17 45
13 32
11 22
34 43
25 25
0 43
19 23
23 33
;
Run;

Data Math1;
Set Numbers;
Val3 = Val1 + Val2;
Val4 = Val1 – Val2;
Run;

Data Math2;
Set Numbers;
Val3 = Val1 * Val2;
Val4 = Val1 / Val2;
Run;
/*ctrl + f4 to close the active window*/

Data Accounting;
Input EmployeeID $ Base OTHours Rate;
Datalines;
ID0001 6300 4 40
ID0002 4500 8 35
ID0003 8800 5 45
ID0004 8900 4 60
ID0005 5600 3 55
ID0006 6000 9 45
ID0007 7600 2 35
;
Run;

Data Salary;
Set Accounting;
Salary = Base + OTHours * Rate;
Run;
/*Sum function in SAS instead of +*/
Data Numbers;
Input Val0 Val1 Val2;
Datalines;
4 1 26
1 4 49
9 . 45
23 13 .
48 11 22
. 34 43
19 25 25
7 0 .
17 10 23
16 23 33
;
Run;

Data Math1;
Set Numbers;
Val3 = Val0 + Val1 + Val2;
Run;

/*When using +, as long as one of the vars has na, the summed value is na.*/
/*The plus sign fails to add up numbers with the presence of missing value*/
Data Math2;
Set Numbers;
Val3 = Sum (of Val0, Val1, Val2);
Run;
/*The sum function will ignore any missing value when adding up the numbers*/

Data Sales;
Input StoreID $ Mon Tue Wed Thr Fri;
Datalines;
STR001 890 750 760 950 1150
STR002 650 450 520 760 970
STR003 1420 1250 1300 1425 1890
STR004 830 . 710 810 1040
STR005 760 550 780 980 1230
STR006 430 530 560 . 970
STR007 1010 980 . 890 1230
STR008 440 740 560 640 740
STR009 670 690 730 950 870
STR0010 570 790 820 810 640
;
Run;

Data TotalSales;
Set Sales;
TotalSales = Sum (of Mon, Tue, Wed, Thr, Fri);
Run;

/*mean, min, max with SAS*/
Data Numbers;
Input Val0 Val1 Val2;
Datalines;
4 1 26
1 4 49
9 . 45
23 13 .
48 11 22
. 34 43
19 25 25
7 0 .
17 10 23
16 23 33
;
Run;
Data Math;
Set Numbers;
m1 = Mean(val0, val1, val2);
m2 = Min(val0, val1, val2);
m3 = Max(val0, val1, val2);
Run;

Data Exam;
Input ID $ Test1 Test2 Final;
Datalines;
9000001 90 87 66
9000002 66 89 98
9000003 20 45 73
9000004 75 74 100
9000005 44 85 76
9000006 71 82 89
9000007 82 83 88
9000008 68 71 54
9000009 36 48 46
9000010 71 73 98
;
Run;

Data FinalScore;
Set Exam;
FinalScore = 0.5 * max(Test1, Test2) + 0.5 * Final;
Run;

/*Rand function*/
Data Template;
do i = 1 to 10;
output;
end;
Run;
/*create random number*/
/*uniform distribution – every data point along the line has equal probability of being selected*/
Data Rand1;
Set Template;
Randno = Rand(‘uniform’);
Run;
/*normal distribution – second and third parameters are mean and st*/
Data Rand2;
Set Template;
Randno = Rand(‘normal’, 5, 1.2);
Run;
/*Bernoulli Distribution – The second parameter is the probability*/
Data Rand3;
Set Template;
Randno = RAND(‘bernoulli’, 0.7);
Run;
/*Poisson distribution – second parameter is mean*/
Data Rand6;
Set Template;
Randno = RAND(‘poisson’, 2);
Run;
Data Stat;
Input Player $ Stat $ Current;
Datalines;
L.James Points 28.6
L.James Rebounds 10.3
L.James Assists 6.4
L.James Blocks 0.8
L.James FG 47.7
;
Run;
/*generate random number within the range of 0.9-1.1 with uniform distribution*/
Data Template2;
do i = 1 to 5;
output;
end;
Run;
/*set the boundries of the output values in 2nd and 3rd parameters*/
/*apply the radom number on the current stats to get predicted value*/
Data PredStat;
Set Stat;
PredSat = Current * RAND(‘uniform’, 0.9, 1.1);
Run;

/*round function*/
Data Numbers;
input randno;
datalines;
7.820995511
5.6069808411
2.6857884419
8.2832695455
4.8186799493
3.1513387109
11.895538688
6.9183423654
5.9589683432
7.7071493822
;
Run;
/*round numbers the second parameter is the multiple of number of decimal places*/
Data Numbers2;
Set Numbers;
Randno2 = round(Randno, 0.2);
Run;
/*e.g. if just one decimal place then 0.1*/
Data Numbers2;
Set Numbers;
Randno2 = round(Randno, 0.1);
Run;
/*e.g. if just round in 100s then then 1000*/
Data Numbers1;
Input Revenue;
Datalines;
1000320
2030002300
3253232
532150212
234123342
;
Run;
Data Numbers2;
Set Numbers1;
Randno2 = round(Revenue, 1000);
Run;

Data Grocery;
Input Item Price Discount;
Datalines;
1 60.49 10
2 79.99 13
3 47.99 7
4 50.09 0
5 43.99 5
6 81.79 7
7 68.17 3
8 103.49 10
9 121.78 8
10 78.69 8
;
Run;
Data Grocery2;
Set Grocery;
DisPrice = round(Price * (1-Discount/100), 0.01);
Run;
/*Floor function – round down numeric values*/
Data Numbers;
input randno;
datalines;
7.820995511
5.6069808411
2.6857884419
8.2832695455
4.8186799493
3.1513387109
11.895538688
6.9183423654
5.9589683432
7.7071493822
;
Run;
/*round down the values to whole numbers*/
Data Numbers2;
Set Numbers;
Randon2 = Floor(Randno);
Run;
/*round down the values to 1 decimal place – cannot do directly with floor. needs some math tweak*/
Data Numbers2;
Set Numbers;
Randon2 = Floor(Randno*10)/10;
Run;

/*Substr function – get substrings from strings or characters*/
Data PhoneList;
Input Tel $20.;
datalines;
416-222-2222
905-222-333365
289-222-4444
604-333-2121
780-444-8989
514-555-59592
902-333-1414
204-999-4646
877-111-1212
800-222-8686
;
Run;
/*The second parameter tells SAS where substring starts. If end is not specified, it means get all the way to the end*/
Data PhoneList2;
Set PhoneList;
Ext1 = substr(Tel, 9);
Run;
/*If there is a third parameter, it specifies the number of characters to extract*/
Data PhoneLisr3;
Set PhoneList;
Ext2 = substr(Tel, 9, 2);
Run;
/*get the first 3 characters – starting position is 1 and no. is 3*/
Data PhoneLisr4;
Set PhoneList;
Ext2 = substr(Tel, 1, 3);
Run;
/*Length function – Length() get the number of characters of the variables*/
Data List;
Input Name : $20.;
datalines;
Anna
Christopher
Mary
Jonathan
Tim
Elizabeth
Amy
schwarzenegger
Kevin
Porter
;
Run;
Data List2;
Set List;
Len = Length(Name);
Run;

Data FirstName;
Input Name $80.;
datalines;
Anna
Christopher
Mary
Jonathan
Tim
Elizabeth
Amy
(*$##*($(#*$(#*$(*#($*#(*$#*$#*$(*#*$#*$#*$*#(*$(#
schwarzenegger
Kevin
Porter
;
Run;
Data FirstName2;
Set FirstName;
Len = Length(Name);
Run;
/*in sas, operations are within data tables, you need to assign a name tothe function output*/

/*Double stroke – combine character variabls Var3 = Var1 || Var2;*/
Data Text;
Infile datalines dsd dlm=’,’;
Length Chr1 $10 Chr2 $50;
Input Chr1 $ Chr2 $;
Datalines;
This is a ,SAS tutorial.
My name is, Jonathan.
Today’s to,pic is Combing Character Variables.
;
Run;
/*use || to combine the texts*/
/*strings are separated by space*/
Data Text2;
Set Text;
Chr3 = Chr1 || Chr2;
Run;

Data Profile;
input SiteID $ PatientID $;
datalines;
S01 991
S01 992
S01 993
S02 991
S02 992
S02 993
S03 991
S03 992
S03 993
S04 991
;
Run;

Data Profile2;
Set Profile;
ID = SiteID || PatientID;
Run;
/*Trim function – to remove the trailing space from SiteID*/
Data Profile3;
Set Profile;
ID = Trim(SiteID) || PatientID;
Run;

Data FirstLast;
Input First $ Last $;
Datalines;
Justin Jeever
Amy Chang
Alex Toby
Andrew Khan
MaryAnn Garabed
Leslie Song
Jeff Barrett
Jeev Daha
Casey Carson
Ann Chen
;
Run;
/*apart from characters, symbols and spaces can also be combined using ||. It’s like & in excel*/
Data FirstLast2;
Set FirstLast;
Name = trim(First) ||”, “|| Last;
Run;

/*Index function – Var2=Index(Var1,”SAS”);*/
Data Example;
Text = “This is a SAS tutorial”;
Run;
/*get the starting position of the string. spaces also count*/
Data Example2;
Set Example;
Ind = Index(Text, “SAS”);
Run;
/*If the word is not found, the index function will return to the value of 0*/
Data Example3;
Set Example;
Ind = Index(Text, “training”);
Run;
/*the mathing is case sensitive. If not strictly matches, it will still return to 0*/
Data Example4;
Set Example;
Ind = Index(Text, “sas”);
Run;

Data MathExam;
infile datalines dsd;
input Comment $40.;
datalines;
Mary’s exam result is 89
Jonathan’s exam result is 77
Ann’s exam result is 97
Christopher’s exam result is 68
Peter’s exam result is 90
Jeev’s exam result is 55
Tina’s exam result is 86
Porter’s exam result is 91
MaryAnn’s exam result is 69
Tim’s exam result is 72
;
Run;
/*use index + substr to extract substrings*/
/*extract the results from string using index and substr*/
Data MathExam2;
Set MathExam;
Ind = Index(Comment, “t is”)+5;
Result = substr(Comment, Ind, 2);
Drop Ind;
Run;
/*I did not use “is” as index as the name Christeper also contains “is”*/

/*indexc function – similar to index function. searches for individual characters instead of the whole text*/
/*i.e. does the text contain any character of the string*/
/*Var2=Indexc(Var1,”!@#”);*/
/*it’s ususally used to clean the data*/
Data Name;
Input FirstName $20.;
Datalines;
Abraham
Jonathan
MaryAnn
Christopher
Benjamin
Michael
Terry###
Anderson
Paul
Porter
;
Run;
Data Name2;
Set Name;
Ind = Indexc(FirstName, “#@$%”);
Run;
/*The outcome returns to 0 or the starting position of the matching character*/

/*Find the special character as the key to position the result number*/
Data Mathexam3;
Set Mathexam;
Ind = Indexc(Comment, “‘”)+ 18;
Result = substr(Comment, Ind, 2);
Drop Ind;
Run;

/*upcase, lowcase and propcase functions*/
/*Var2=UPCASE (Var1); Var3=LOWCASE (Var1); Var4=PROPCASE(Var1);*/
/*Propcase mean cap in every first character*/
Data Example;
Comment = “This is an apple.”;
Comment2 = Upcase(Comment);
Comment3 = Lowcase(Comment);
Comment4 = Propcase(Comment);
Run;
/*this is useful when we need to index something. As index function is case sensitive*/

Data Symptoms;
Infile datalines dsd dlm=”~”;
Length Sickness $15 Symptoms $100;
Input Sickness $ Symptoms $;
Datalines;
Nausea~headache,vomiting,Dizziness,lightheadedness,Diarrhea,abdominal pain
Asthma~Coughing,shortness of breath,Chest Tightness,Wheezing
Diabetes~Weight change,Extreme FATIGUE,Unusual thirst,frequent urination
Hepatitis~fatigue,poor appetite,belly Pain,Mild fever
;
Run;
/*to find the keywords in the text chunks that have both uppper cases and lower cases, it’s better standadise the texts first*/
Data Symptoms2;
Set Symptoms;
Symptoms = Lowcase(Symptoms);
Ind = Index(Symptoms, “fatigue”);
Run;

/*rename, keep and drop statements*/
/*RENAME NAM1 = NAM2; KEEP Var1 Var2 Var3; DROP Var4 Var5 Var6;*/
/*Rename refers to rename variables*/
Data Profile;
Input First_Name $ Last_Name $ Participants_age;
Datalines;
Porter Williams 28
Amy Battison 43
John Lim 22
Casey Walson 27
Black Chen 42
Dominic Abraham 24
Mary Katy 18
Tina Kuo 59
Timothy Latino 45
Nigel Benjamin 34
;
Run;

Data Profile2;
Set Profile;
Rename First_Name = First
Last_Name = Last
Participants_age = Age;
Run;
/*rename is one statement no matter how many vars are renamed – don’t add separate ; */
Data Profile3;
Set Profile2;
Keep First Age;
Run;
Data Profile4;
Set Profile2;
Drop Last;
Run;
/*Keep and drop can also be added at set stage which can imcrease the efficiency and save space when operating on big data*/
Data Profile3;
Set Profile2 (Keep = First Age);
Run;
Data Profile4;
Set Profile2 (Drop = First Age);
Run;
Data Margarin;
Set SASHELP.Margarin;
Keep HouseID Choice Brand;
Rename HouseID = ID;
Run;
/*If else statements*/
/*IF Var1=1 Then Var2=2; ELSE Var2=4;*/
Data Exam;
Input StudentNo $ MidTerm Final;
Datalines;
STU001 56 98
STU002 34 76
STU003 89 93
STU004 66 77
STU005 74 47
STU006 63 79
STU007 80 82
STU008 29 74
STU009 69 70
STU010 92 99
;
Run;
/*If and Else are separate statements*/
Data Exam2;
Set Exam;
If MidTerm > 50 and Final > 50 Then Result = “Pass”;
Else Result = “Fail”;
Run;

/*Do statement – it allows you to perform multiple actions within a if-then statement*/
/*put the actions between do and end*/
Data Exam3;
Set Exam;
If Final > 50 then do;
Result = “Pass”;
Mark = 0.5*MidTerm + 0.5*Final;
End;
Run;
/*Do statement starts with a do and you can then execute as many as actions as you want within the do statement. */
/*Finally you close the statement with an end statement*/
Data Cars;
Set SASHELP.Cars;
If MSRP <40000 Then Class = “Economy”;
Else Class = “Luxury”;
Run;

/*Coding exersice*/
/*Exersice 1 – Calculate Phone Bills with consideration of extra costs*/
Data Phonebill;
Input BillNo $ Minutes Messages Data;
Datalines;
B001 238 15 670
B002 323 39 1300
B003 89 2 230
B004 124 22 890
B005 156 34 1290
B006 122 22 1180
B007 211 26 450
B008 189 28 990
B009 123 23 220
B010 101 7 330
;
Run;
/*Phonebill exercise – you need to think through the problem before writing the code.*/
/*e.g. how to caputure the extra minutes/message */
/*My answer*/
Data Phonebill2;
Set Phonebill;
If Minutes > 200 Then ExtraMin = Minutes – 200; Else ExtraMin = 0;
If Messages > 25 Then ExtraMsg = Messages – 25; Else ExtraMsg = 0;
If Data > 1000 Then ExtraData = Data – 1000; Else ExtraData = 0;
If Minutes <= 200 and Messages <= 25 and Data <= 1000 Then Bill = 30;
Else Bill = 30 + 0.1*ExtraMin+ 0.5*ExtraMsg+ 0.05*ExtraData;
Run;
/*Model answer*/
/*It uses a Max function to avoid using the if else statement. Bascially similiar approach*/
/*A good way of using Max is to use a variable against a benchmark / baseline / 0*/
Data PhoneBill3;
Set PhoneBill;
Base = 30;
Add_Min = Max(Minutes-200, 0);
Add_Meg = Max(Messages-25, 0);
Add_Dat = Max(Data-1000, 0);
Amount = Base + Add_Min*0.1 + Add_Meg*0.5 + Add_Dat*0.05;
Run;
/*Exersice 2 – Get average*/
Data Race;
Input Car $ Lap1 Lap2 Lap3 Lap4 Lap5;
Datalines;
CAR001 390.31 410.9 460.8 380.56 470.27
CAR002 410.19 440.99 390.61 400.14 430.62
CAR003 420.86 410.68 400.61 390.29 460.16
CAR004 390.65 380.87 430.19 380.82 470.54
CAR005 420.62 430.82 430.15 400.95 430.33
;
Run;
Data Race2;
Set Race;
Avg = mean(Lap1, Lap2, Lap3, Lap4, Lap5);
Run;
/*Exersice 3 – Generate random index with uniform distribution and get only index 4 claims*/
Data Expense;
do i = 101 to 200;
Claim = “CLAIM” || put(i, 3.0);
output;
end;
drop i;
Run;
/*My answer uses round(), rand() and if*/
Data Expense2;
Set Expense;
Index = Round(Rand(‘uniform’, 1, 10));
If Index = 5;
Run;
/*Model answer uses floor(), rand() and if*/
Data Expense3;
Set Expense;
RandInt = Floor(Rand(‘uniform’)*10) + 1;
If RandInt = 5;
Run;
/*Rand(‘uniform’) itself is 0-1 when the boundry is not set*/

/*Exersice 4 – Get the age at marridge*/
Data Date;
Input ID $ BirthDate MarriageDate;
Format BirthDate MarriageDate yymmdd10.;
Datalines;
ID001 4423 19215
ID002 3234 19216
ID003 3242 19215
ID004 5124 19217
ID005 4921 19216
ID006 2772 19215
ID007 4212 19216
ID008 3821 19215
ID009 2412 19216
ID010 4124 19217
;
Run;
Data Date2;
Set Date;
Age = Floor((MarriageDate – BirthDate)/365.25);
Run;
/*We need to use Floor to round down in this case, as unless the birth day is past, we consider the person still at a smaller age.*/

/*Exersice 5 – Separate First and Last names*/
Data Name;
Infile Datalines dlm=”,”;
Length Name $30;
Input Name $;
Datalines;
Nancie Strohl
Janee Salvaggio
Annabelle Schwanke
Marybelle Holtsclaw
Art Dahlen
Barrie Bianco
Barbera Deltoro
Launa Gaudreau
Isabelle Tiernan
Annamae Bellin
;
Run;

/*My answer: Index the space and substr names*/
Data Name2;
Set Name;
Ind = Indexc(Name, ” “);
FirstName = substr(Name, 1, Ind-1);
LastName = substr(Name, Ind+1);
Run;

/*Exersice 6 – Classify the products based on the first digit*/
Data Product;
Input Barcode $16.;
Datalines;
584563449723
584563358735
384563675036
084563154684
084563447166
984563804834
284563674476
884563657341
784563480621
384563604852
;
run;
/*Need to be careful about indent for datalines, the spaces might be counted as length*/
/*My answer – If-then function, substr function*/
Data Product2;
Set Product;
CatInd = substr(Barcode, 1, 1);
If CatInd = 2 Then Cat = “Meats, Fruits, Vegetables”;
If CatInd = 3 Then Cat = “Drug”;
If CatInd = 4 Then Cat = “Loyalty Card”;
If CatInd in (5, 9) Then Cat = “Coupon”;
If CatInd in (0, 1, 6, 7, 8) Then Cat = “General”;
Run;
/*CatInd in (1, 2, 4, 5, 6) can be used, same as in SQL*/

/*Model answer – if d1 = 5 or d1 = 9 is the same as if d1 in (5, 9)*/
Data Product3;
Set Product;
Length class $30;
d1 = substr(barcode, 1, 1);
if d1 = 2 then class = “Meat, Fruits, Vegetables”;
else if d1 = 3 then class = “Drug”;
else if d1 = 4 then class = “Loyalty Card”;
else if d1 = 5 or d1 = 9 then class = “Coupon”;
else if d1 in (0, 1, 6, 7, 8) then class = “General”;
Run;

/*Exersice 7-8 extract other info from barcode*/
Data Product4;
Set Product;
Manufacture = substr(barcode, 2, 5);
Run;

Data Product5;
Set Product;
Product = substr(barcode, 7, 5);
If Product = 67503;
Run;
/*Exersice 9 check whether the last digit (check digit) is correct*/
Data Product6;
Set Product;
/*Last Digit*/
d12 = substr(barcode, 12, 1);
/*Check Digit Calculations*/
d1=substr(barcode, 1, 1);
d2=substr(barcode, 2, 1);
d3=substr(barcode, 3, 1);
d4=substr(barcode, 4, 1);
d5=substr(barcode, 5, 1);
d6=substr(barcode, 6, 1);
d7=substr(barcode, 7, 1);
d8=substr(barcode, 8, 1);
d9=substr(barcode, 9, 1);
d10=substr(barcode, 10, 1);
d11=substr(barcode, 11, 1);
/*Calculation*/
odd = sum(of d1, d3, d5, d7, d9, d11)*3;
even = sum(of d2, d4, d6, d8, d10);
check = 10 – mod(odd+even, 10);
/*Check if Last Digit is the same as Check Digit*/
if d12 ^= check then flag = 1;
Run;
/*can I use the do / loop logic for it? i*2-1 for odd and i*2 for even*/
/*Yes, you can use do statement or array*/

/*Code Project for Module 3*/
/*use %include to run a file of long codes – this can be used to run stored procedures*/
%include ‘C:\Users\xxxLearning\SAS-test\SASUniversityEdition\myfolders\coding_exercise_(code).txt’;
/*Perform data checks on the dataset*/
/*Instead of print distinct values to check, use some calculations and assumptions to validate*/
/*Numeric errors: Abnormal age, income and spend*/
Data CkAge;
Set Customer;
Age = (DOS – DOB) / 365.25;
If Age > 140 Or Age < 0;
Comment = “Abnormal Age Value ” || Compress(Age) ||””;
Variable = ‘Age’;
Value = put(Age, best.);
Run;
/*make reasonable assumptions based*/
Data CkIncome;
Set Customer;
If Income > 500000 Or Income < 0;
Comment = “Negative Income Value ” || Compress(Income) ||””;
Variable = ‘Income’;
Value = put(Income, best.);
Run;
/*Wrong spend is based on the assumption of “it should not be more than 3 times of income”*/
Data CkSpend;
Set Customer;
If Spend > 3*Income;
Comment = “Abnormal Income Value (” || Compress(Spend) ||”)”;
Variable = ‘Spend’;
Value = put(Spend, best.);
Run;
/*CHAR errors: special characters in name, wrong custid formats, wrong occup input*/
/*Use indexc to inspect special characters or even characters*/
Data CkCustID;
Set Customer;
Lg = Length(CustID);
Ind = Indexc(CustID, “abcdefghijklmnopqrstuvwxyz,.!/@#$%^&*()_+?}-{<“);
If Lg ^= 8 Or Ind ^=0;
Comment = “Special Char in CustID (” || Compress(CustID) ||”)”;
Variable = ‘CustID’;
Value = CustID;
Run;

/*check whether the name strings contain special characters – input errors*/
Data CkNames;
Set Customer;
Ind1 = Indexc(First, “!@#$%^&*()<>?{}:1234567890”);
Ind2 = Indexc(Last, “!@#$%^&*()<>?{}:1234567890”);
If Ind1 ^= 0 Or Ind2 ^=0;
Comment = “Special Char in Last Name (” || Compress(Last) ||”)”;
Variable = ‘Last’;
Value = Last;
Run;
/*- or / is allowed in occup*/
Data CkOccup;
Set Customer;
Ind = Indexc(Occup, “!@#$%^&*()<>?{}:1234567890”);
If Ind ^= 0;
Comment = “The Occup (” || compress(Occup) || “) contains invalid character(s).”;
Variable = ‘Occup’;
Value = Occup;
Run;
/*check categorical variables*/
Data CkGender;
Set Customer;
If Gender ^= ‘Female’ and Gender ^= ‘Male’;
Comment = “Unknown Gender (” || compress(Gender) || “)”;
Variable = ‘Gender’;
Value = Gender;
Run;
Data CkEdu;
Set Customer;
If Edu not in (1, 2, 3, 4);
Comment = “Undefined Edu Level (” || compress(Edu) || “)”;
Variable = ‘Edu’;
Value = put(Edu, best.);
Run;
/*although Edu has been recoded in Texts in format statement, it still should be indexed with raw values*/
Data CkStatus;
Set Customer;
If Status not in (‘Married’, ‘Single’, ‘Divorced’);
Comment = “Unkown Marridge Status (” || compress(Status) || “)”;
Variable = ‘Status’;
Value = Status;
Run;
/*Need to check the trauncating issue – use inform*/
Data ErrorReport (Keep = CustID Variable Value Comment);
Retain CustID Variable Value Comment;
Length Variable $30 Value $30 Comment $500;
Set CkAge CkIncome CkSpend CkCustID CkNames CkOccup CkGender CkEdu CkStatus;
Run;
/*Use retain statement to re-order the columns. Retain needs to be before Set*/
Proc Export Data = ErrorReport
OutFile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\Error Report.csv’
Replace
Dbms = csv;
Run;

**Somethings to take note of when appending:*;
/*The values are in different types by default as some are numeric and some are characters*/
/*Thus we need to transfrom numebers in numeric forms into char by using put, or else they can’t be appended*/
/*Another thing is that if length is not pre-defined, the char length of the variables will take the length of the first one*/
/*Thus the latter ones could be truncated. We need to define the length before Set Statement*/
/*Although Edu have been recoded in texts, the error ones are not coded, so they are still in numeric forms. Their data types need to be changed*/
/*Own test – find a word that starts with t, ends with t and has a t in the middle*/

/*Model Answer*/
**The model answer does everything in one shot – it uses if then do – end logic to select and combine all the errors within */
one data step**;

Proc Import datafile = ‘C:\Users\shirui.zhang\Desktop\GfK Learning\SAS-test\SASUniversityEdition\myfolders\word.txt’
out = word
dbms = tab
replace;
delimiter = ” “;
Run;
Data Word;
Set word;
Rename _2 = Words;
Run;
/*create vars and check*/
Data Word2;
Set Word;
Lg = Length(Words);
FL = substr(Words, 1, 1);
LL = substr(Words, Lg, 1);
Run;
/*Filter and select*/
Data Word3;
Set Word2;
Mid = substr(Words, 2, Lg-2);
If Lowcase(FL) = ‘t’ and Lowcase(LL) = ‘t’ and Index(Lowcase(Mid), ‘t’)^=0;
Run;
/*In the end, the answer is Teapot, lol*/

**6 attributes of SAS variable*;
**-name -lable -length -type -format -informat*;
*It is estimated that 30% of the programming error are related to incorrect variable attributes.;
Data Fish;
Set SASHelp.Fish;
Run;
**Specis has name: Species, label: n.a. type: text length: 9 format and informat: n.a.;
/*We can go properties of the dataset, go to columns and click in the variable to check its attributes*/

/*Another way to review is via proc contents*/
/*Proc Contents to view variable attributes*/
Proc Contents Data = SASHelp.Fish;
Run;
/*In Alphabetic List of Variables and Attributes table, you will be able to find the variable attributes info*/
Data Cars;
Set SASHelp.cars;
If Enginesize > 4;
Run;
**Variable naming rule*;
/*no spaces, no special symbols except for “_”, no longer than 32 chars, no starting with numbers */

**Variable labels are optional and it describes the variable*;
Data Cars;
Set SASHelp.cars;
Label EngineSize = “The size of the engine”;
Run;
/*by using this label statement, the label of the variable will be added/updated*/
/*it’s usually used as addtional information for third party – can use up to 256 chars*/

Data Admission;
Input Applicant $ Crit1 $ Crit2 $ Crit3 $;
Datalines;
P00851 Met Met Met
P00852 Met NotMet Met
P00853 Met Met Met
P00854 NotMet NotMet NotMet
P00855 Met Met NotMet
P00856 Met Met Met
P00857 NotMet Met Met
P00858 Met NotMet Met
P00859 Met Met Met
P00860 Met Met Met
;
Run;

Data Admission;
Set Admission;
Label Crit1 = “Applicant’s GPA above 3.5”
Crit2 = “Applicant’s did not fail more than 4 courses”
Crit3 = “Applicant’s has a Bachelor of Science degree”
;
Run;
/*Only one label statement is needed for all the labels added*/
Data Number;
Input Age Amount Date;
Format Date yymmdd10.;
Datalines;
41 3409 18478
37 2036 18946
48 3763 20288
58 2815 19732
59 4891 20616
26 2185 20687
46 3362 20514
64 4653 19907
60 3878 20384
61 3307 20386
;
Run;

Proc Contents Data = Number;
Run;

/*Missing value*/
/*Missing value can be put as a dot in datalines*/
Data Number2;
Input Age Amount Date;
Format Date yymmdd10.;
Datalines;
41 3409 18478
37 2036 18946
. 3763 20288
58 2815 19732
59 4891 20616
26 2185 20687
46 . 20514
64 4653 19907
60 3878 20384
61 3307 20386
;
Run;
/*When we check the data types, we can see that money amount with dollar sign, date, missing value can all be kept under numeric variable*/

Data Char;
Infile Datalines DLM=”,”;
Length Name $8 ID $8 Comment $13 Descrip $10 Age $5;
Input Name $ ID $ Comment $ Descrip $ Age $;
Datalines;
John,ID001,This is John,Male,36
Mary,ID002,This is Mary,Female,19
Anton,ID003,This is Anton,Male,28
Chris,ID004,This is Chris,Female,48
;
Run;
/*In this example, the age is formatted in char. Char data type can capture both numbers and texts*/
/*However, it’s a good programming practice to keep the numeric values under numeric variabl.*/
**Data truncation is a very common problem with character variable. Thus we need to constantly check the df created in case of errors;

**The length of a variable is the number of bytes assigned to the variable regardless numeric or char vars;
**3 bytes of numbers is not 3 diguts, the largest integer 3 bytes can caputure is 8,192, largest number 8 bytes can capture is 9,007,199,254,740,990 which is big enought;
**A numeric variable cannot have a length greater than 8;
**while it’s a different thing for char vars, length 8 can only capture 8 characters;
Data Char;
Input Name $ Gender $;
Datalines;
Christopher M
Elizabeth F
Kanjirathinkal M
;
Run;
/*As shown in the example, data truncation takes place a lot for char vars. Thus we need to use length statement to assign a new length for them*/
Data Char2;
Length Name $15;
Input Name $ Gender $;
Datalines;
Christopher M
Elizabeth F
Kanjirathinkal M
;
Run;
/*Length should be before the Input statement*/
Data PhoneBook;
Length PhoneNo $15;
Input Name $ PhoneNo $ Gender $;
Datalines;
Jason 123-456-7890 Male
Benjamin 987-654-3210 Female
;
Run;

/*Data Format*/
**Variable formats change the display of the data. It masks the data so that it’s different than the value stored in the system;
Data Display;
Input Num1 Num2;
Datalines;
10001 10001
10002 10002
10003 10003
10004 10004
10005 10005
10006 10006
10007 10007
10008 10008
10009 10009
10010 10010
;
Run;
/*In this program, we set the format of Num2 variable to dollar 10. This format means that the number now includes a dolalr sign and the comma separator*/
Data Display2;
Set Display;
Format Num2 Dollar10.;
Run;
/*In it’s dataset properties, the format will also be displayed as dollar10.*/
**Variable format only changes the display of the data, not the internal data stored in the system;
/*i.e. in the example above, Num1 and Num2 are the extact same data*/
/*Numeric data can also be displayed as date/time format*/
Data Display3;
Set Display;
Format Num2 yymmdd10.;
Run;
Data Display4;
Set Display;
Format Num2 time8.;
Run;

Data Transac;
Input Order $ Date Time Total;
Datalines;
BA00001 19000 40143 1432
BA00002 19000 51865 1455
BA00003 19000 68954 2435
BA00004 19000 49865 894
BA00005 19000 53214 1745
BA00006 19000 64521 997
BA00007 19000 74521 562
BA00008 19000 56321 132
BA00009 19000 51236 987
BA00010 19000 45698 562
;
Run;
/*In this example, format is used to get the proper display of data*/
Data Transac2;
Set Transac;
Format Date yymmdd10. Time time8. Total dollar10.;
Run;
/*The differences between figures stored in numeric vs. char format*/
Data DollarAmt;
Input D1 D2 $;
Format D1 dollar9.0;
Datalines;
10000 $10,000
20000 $20,000
30000 $30,000
40000 $40,000
50000 $50,000
60000 $60,000
70000 $70,000
80000 $80,000
90000 $90,000
100000 $100,000
;
Run;
/*In this example, D1 is in dolalr10. type in numeric format while D2 is in char format. They look the same but only D1 allows numeric data handling and summary.*/
/*Thus when needed, you should always use numeric var with a relevant format as opposed to using a char value. You can always change the format of display for better viewing of info*/

/*The format of W.D.*/
/*W defines the digits for display the full number including the digits*/
/*D defines the number of decimal places*/

Data Num;
Input Num1 Num2;
Datalines;
12345.67 12345.67
34561.11 34561.11
21165.35 21165.35
13216.31 13216.31
61343.11 61343.11
15654.61 15654.61
21313.24 21313.24
44131.31 44131.31
94543.13 94543.13
48634.31 48634.31
;
Run;
Data Num2;
Set Num;
Format Num2 10.4;
Run;
/*10 is the number of digits of the data string (including the decimal dot itself*/
/*One thing must be beard in mind is that no matter how the format changes, the data does not change*/
Data RNumber;
Input Num1;
Datalines;
23483.8492342
432.2348934
5489345.23423
432.53242
98349.2343
45893495.2343
45489.3454
6845984.234
58498349.543
25484.23545
;
Run;
Data RNumber1;
Set RNumber;
Format Num1 13.4;
Run;
/*This is how you put display numbers with comma*/
Data Num;
Input Num1 Num2;
Datalines;
12345.67 12345.67
34561.11 34561.11
21165.35 21165.35
13216.31 13216.31
61343.11 61343.11
15654.61 15654.61
21313.24 21313.24
44131.31 44131.31
94543.13 94543.13
48634.31 48634.31
;
Run;
Data Num2;
Set Num;
Format Num2 Comma.;
Run;
/*However, after quickly checking the output, we can see that the decimals are gone*/
/*Here is a way to keep / re-define the decimals */
Data WD2;
Set Num;
Format Num2 Comma9.2;
Run;
/*It’s the same usage as format w.d*/
/*You can also use dollar w.d. to add dollar sign and decimal places*/
Data Transac;
Input Order $ Date Time Total;
Datalines;
BA00001 19000 40143 1432
BA00002 19000 51865 1455
BA00003 19000 68954 2435
BA00004 19000 49865 894
BA00005 19000 53214 1745
BA00006 19000 64521 997
BA00007 19000 74521 562
BA00008 19000 56321 132
BA00009 19000 51236 987
BA00010 19000 45698 562
;
Run;
Data Transac2;
Set Transac;
Format Total Dollar5. Total_After_tax Dollar8.2;
Total_After_tax = Total * (1-0.1475);
Run;
**When adding dollar sign and comman, you need to count in the digits for dollar sign ($) and comma (,). If the digits are not counted in
dollar sign or last decinal numbers will be removed;

/*SAS Dates*/
**SAS date starts at January 1, 1960. It’s represented by integer 0.
**Jan 2, 1960 is represented by 1
-1000 = April 6, 1957;
**SAS uses this system – number indexing to keep track of dates, instead of 01JAN2018 or 2018-01-01;

/*SAS Time*/
**Each integer represents 1 second
0 = 0:00:00
360 = 0:06:00
3600 = 1:00:00;

/*Here is an example of how to conver the dates in number forms to be displayed in date format*/
Data DFormat;
Input Date;
Datalines;
-130001
10002
10003
21500
40005
50006
10007
10008
10009
90010
;
Run;
Data DFormat2;
Set DFormat;
Format Date YYMMDD10.;
Run;
**DDMMYY10. is a very common date display type;
Data DenAppt;
Input Date Time NumDays;
Format Date yymmdd10. time time5.;
Datalines;
20100 30000 30
20110 40000 60
20115 46000 14
20103 36000 90
20120 42000 60
20117 44000 30
20111 38000 14
20109 37000 21
20102 33000 28
;
Run;
Data DenAppt2;
Set DenAppt;
Format NextAppt yymmdd10.;
NextAppt = Date + NumDays;
Run;
/*The format of NextAppt needs to be defined to be displayed in the same way as Date*/
**Some other date formates used in SAS: yymmdd10. Date9. Date7. mmddyy10.;
/*Here is an example to view these date formats*/
Data DFmt;
Input Date1 Date2 Date3 Date4;
Datalines;
12001 12001 12001 12001
12002 12002 12002 12002
12003 12003 12003 12003
12004 12004 12004 12004
12005 12005 12005 12005
12006 12006 12006 12006
12007 12007 12007 12007
12008 12008 12008 12008
12009 12009 12009 12009
12010 12010 12010 12010
;
Run;
Data DFmt2;
Set DFmt;
Format Date1 yymmdd10. Date2 Date9. Date3 Date7. Date4 mmddyy10.;
Run;
**yymmdd10. = 1992-11-09 mmddyy10. = 11/9/1992 Date9. = 09NOV1992 Date7. = 09NOV92;
/*The yymmdd10. and mmddyy10. has – and / respectively while Date7. and Date9. have no separator but use three letters as months*/

**The two common time formats are Time5. and Time8.;
**Time5. = 12:56 Time8. = 12:56:28;
Data TFmt;
Input Time1 Time2;
Datalines;
36001 36051
39001 39001
42001 42001
45001 45001
48001 48001
51001 51051
54001 54001
57001 57001
60001 60001
63001 63001
;
Run;
Data TFmt2;
Set TFmt;
Format Time1 Time5. Time2 Time8.;
Run;
/*The Time8. won’t be the rounded Time5., it’s simply Time8. without seconds i.e. Time8. 10:50:59 = Time5. 10:50 not 10:51*/
Data Attendance;
Input EmpID $ Date In Out;
Datalines;
ID501 20010 28800 61200
ID502 20010 30400 63200
ID503 20010 29800 61800
ID504 20010 33600 61700
ID505 20010 27800 60200
ID506 20010 29400 61900
ID507 20010 29300 61850
ID508 20010 31600 62930
ID509 20010 32400 62300
ID510 20010 31200 61590
;
Run;
Data Attendance2;
Set Attendance;
Format Date yymmdd10. In Time8. Out Time8. Late $7.;
If In > 32400 Then Do;
Late = “Late”;
LateBy = (In – 32400)/60;
End;
Else Do;
Late = “On Time”;
LateBy = 0;
End;
Run;
/*To set a time condition for variable, the time/date should still be in system format i.e. 10000/32400 etc. The way to calculate it is
to use time – midnight time (00:00:00) and convert to seconds*/
/*If else condition have separate statemetns each ends with “;” thus it’s if xxx; else yyy;
to have multiple conditions in one if else logic, use if xxx then do; yyy zzz; end; */
/*For if then do end logic, there is “;” after then do and each assignment/condition is separated by “;”*/

/*Create your own format – Proc Format*/
/*Sometimes we want to change the way how data is displayed. E.g. in the example below, survey outcomes are shown but we don’t know what they mean*/
/*Thus we update their format for a clearer idea*/
Data Survey;
Input Q1 Q2 Q3 Q4;
Datalines;
3 1 2 4 5
1 1 1 1 1
5 3 5 5 5
4 5 3 2 2
1 3 5 4 5
2 3 5 4 5
3 4 5 2 3
2 4 1 3 5
5 4 4 4 4
1 1 2 5 4
;
Run;
/*We will need to use proc format to create a format to the dataset.*/
Proc Format;
Value Ques
1 = “Strongly Disagree”
2 = “Disagree”
3 = “Neutral”
4 = “Agree”
5 = “Strongly Agree”;
Run;
/*With the proc format, format ques is created.*/
/*When applying the format, we still use format statement in data step as per normal. Just add Ques after the variables that needs to be formatted*/
/*We only need to add one Ques. after all the variables that need to be formatted*/
/*. is needed after all the formats e.g. dollar10.2, Ques. comma.*/
Data Survey2;
Set Survey;
Format Q1 Q2 Q3 Q4 Ques.;
Run;
/*Now the display shows “agree/strongly agree” instead of 4/5*/

**for categorical variables, it’s a good practice to keep them in numeric variables with a custom format
because it’s easier to do analysis and operations and it helps the variable to aviod the character trauncation issue;
Data Profile;
Length ID Gender Race $20.;
Input ID Gender $ Race $;
Datalines;
ID001 Male Caucasian
ID002 Female Asian
ID003 Female Black
ID004 Male Asian
ID005 Male Caucasian
ID006 Female Caucasian
ID007 Male Black
ID008 Female Black
ID009 Male Caucasian
ID010 Male Caucasian
;
Run;
/*You can add multiple formats in one proc format step*/
Proc Format;
Value Gender
1 = “Male”
2 = “Female”;
Value Race
1 = “Caucasian”
2 = “Asian”
3 = “Black”;
Run;
Data Profile;
Input ID $ Gender Race;
Format Gender Gender. Race Race.;
Datalines;
ID001 1 1
ID002 2 2
ID003 2 3
ID004 1 2
ID005 1 1
ID006 2 1
ID007 1 3
ID008 2 3
ID009 1 1
ID010 1 1
;
Run;
/*Previously, the Gender and Race are text. Now they are numeric with text custom format.*/
** the data types are set upon dataset creation. It can’t be alter;

/*(Best.) Format*/
**The (Best.) format displays the data as the internal value stored in SAS;
Data DateTime;
Input Date Time;
Format Date yymmdd10. time time5.;
Datalines;
18100 30540
18210 40980
18315 41230
18403 34650
18320 48970
18217 45550
18111 31100
18009 37900
18502 56400
;
Run;
/*In this example, the internal values of date and time are in numeric forms but have been formatted to be displayed as date and time formats.*/
/*Best. can be used to remove special formats from a variable*/
Data DateTime2;
Set DateTime;
Format Date Time Best.;
Run;
/*When a dataset has already been formatted, Best. can be used to convert it back to internally stored values – without any mask*/
Proc Format;
Value Ques
1 = “Strongly Disagree”
2 = “Disagree”
3 = “Neutral”
4 = “Agree”
5 = “Strongly Agree”;
Run;
Data Survey;
Input Q1 Q2 Q3 Q4;
Format Q1 Q2 Q3 Q4 Ques.;
Datalines;
3 1 2 4 5
1 1 1 1 1
5 3 5 5 5
4 5 3 2 2
1 3 5 4 5
2 3 5 4 5
3 4 5 2 3
2 4 1 3 5
5 4 4 4 4
1 1 2 5 4
;
Run;
/*Now I have a Survey dataset with clearly coded variables. However, if I want to do subset operations, it cannot be on formatted values.*/
/*Thus I need to check back the original values*/
Data Survey2;
Set Survey;
Format Q1 Q2 Q3 Q4 Best.;
Run;
/*From that I can see that if I want to choose all “Strongly Agree” on Q3, I should use 5 (by comparing Survey and Survey 2).*/
Data Survey3;
Set Survey;
If Q3 = 5;
Run;

/*Summary of variable formats*/
**It’s commonly used on numeric variables;
**W.D format allows you to control how many decimal places to have;
**Comma. Dollar. allow you to add “,” or “$”;
**Create your own format use Proc Format;
**Use Best. to get original formats;

/*Put function*/
**Converts a numeric variable to a character varaible;
Data Conv;
Input Num1 Num2;
Datalines;
1201.1234 20001
1303.1234 20002
1405.1234 20003
1507.1234 20004
1609.1234 20005
1202.1234 20004
1304.1234 20003
1406.1234 20002
1508.1234 20001
1610.1234 20000
;
Run;
Data Conv2;
Set Conv;
Char1 = put(Num1, best.);
Run:
/*In this transofmation, Char1 HAS identical information as Num1, however, it’s a char var*/
**The Put function has two parameters: the variable to be converted and the format to use for the conversion;
/*In this example, best. is used as the format for the transformation and it tells SAS to convert the values based on the internal values*/
/*You can also convert the number with another format*/
Data Conv3;
Set Conv;
Char1 = put(Num1, dollar10.2);
Run;
/*This time the data is still in char form but with dollar sign and decimal places*/
/*It can also be converted to a char var with date format*/
Data Conv4;
Set Conv;
Char1 = put(Num1, yymmdd10.);
Run;
**It’s important to be aware of the differences of formatting a variable and converting a variable using the PUT function;
**Formatting does not change the type of the varaible and put/input does;

/*Use the conversion to put numeric into char date*/
Data Trtment;
Input PATID $ Infdt TrtDt;
Format Infdt Trtdt Date9.;
Datalines;
PAT1001 20100 20102
PAT1002 20101 20102
PAT1003 20103 20102
PAT1004 20102 20102
PAT1005 20100 20102
PAT1006 20101 20102
PAT1007 20103 20102
PAT1008 20100 20102
PAT1009 20102 20102
PAT1010 20100 20102
;
Run;
Data Trtment2;
Set Trtment;
Infdt1 = put(Infdt, mmddyy10.);
Trtdt1 = put(Trtdt, mmddyy10.);
If Infdt > Trtdt
Then Do;
Comment = “Patient ” || Compress(PATID) || ” received his/her first treatment on ” || Compress(Infdt1) || “However, the treatment is received prior to the date of informed consent ” || Compress(Trtdt1) ||””;
End;
Run;
/*Compress statement is used to trim any spaces to ensure that there is no extra space. You can directly put “Patient ” PATID ” received his/her first treatment”*/

/*Informat*/
**Infomrt is mostly used when reading in an external file or converting a char var to a numeric var.;
Data Char;
Length Date $12 Time $8 Amount $15;
Input Date $ Time $ Amount $;
Datalines;
2015-01-01 8:00:00 $10,100
2015-01-02 8:05:00 $9,500
2015-01-03 8:05:00 $12,600
2015-01-04 8:10:00 $12,800
2015-01-05 8:15:00 $13,100
2015-01-06 8:05:00 $11,700
2015-01-07 8:03:00 $15,900
2015-01-08 8:02:00 $15,400
2015-01-09 8:04:00 $14,200
2015-01-10 8:05:00 $13,400
;
Run;
/*Input function*/
/*In this dataset, all variables are in char forms but they can be converted to numeric forms*/
Data Char2;
Set Char;
Date2 = Input(Date, yymmdd10.);
Run;
/*Date is Char var but it converted to numeric in Date2 variable*/
**Same as put function, the input parameters are also 1.variable 2.informat;
Data Char3;
Set Char;
Time2 = Input(Time, Time8.);
Run;
Data Char4;
Set Char;
Amount2 = Input(Amount, dollar8.);
Run;
/*Should use dollar8. to convert dollar amount*/
**Here is a list of appropriate informat;
*Pure numbers – Best.
Dollar Aount – DollarW.D
Date – yymmdd10. / mmddyy10. / date9. / date7.
Time – time8. / time5.;
*Informat used should correspond to the current char format;

/*Invalid data error*/
**When using the input funtion, you might encounter a note from the log saying “invalid argument to function input at
Date = 2015- Date2=. _ERROR_ = 1 _N_ = 8;
*This is usually due to an incomplete data point from the variable e.g. 2015- as comapred to 2015-10-11;
*It’s not an error technically, but it’s a good practice to check back in the data to see what’s the issue as it happens often;

**Format is to only change the display of the data but informat is to indeed change the data type;

/*Exersice 1*/
/*Create labels and custom formats for variables*/
Data Survey;
Input Q1 Q2 Q3;
Datalines;
1 3 5
2 4 3
1 2 1
5 5 5
4 5 3
4 3 3
2 3 1
1 5 5
2 4 5
4 3 5
;
Run;
Data Survey2;
Set Survey;
Label Q1 = “Customer service representatives (CSRs) are well supervised.”
Q2 = “Customer service representatives (CSRs) adhere to professional standards of conduct.:”
Q3 = “Customer service representatives (CSRs) act in my best interest.”
;
Run;
/*Labels can be checked in properties – Columns – double click*/
Proc format;
Value Scale
1 = “Strongly Disagree”
2 = “Somewhat Disagree”
3 = “Neither Agree nor Disagree”
4 = “Somewhat Agree”
5 = “Strongly Agree”
;
Run;
Data Survey3;
Set Survey;
Format Q1 Q2 Q3 Scale.;
Run;
**There is always a “.” after format! for custom format, we also need a “.” to tell SAS that’s it’s a format or SAS will see it
as an uninitialised variable;
/*The way to create formats and apply on dataset is to firstly create the format using proc format and then apply the format in data step*/;
/*Proc format itself is independent from the dataset*/
/*Model answer combined the label and format in the second data step after the custom formats are created*/

/*Exersice 2*/
/*Convert char vars to numeric vars and assign proper formats*/
Data Listing;
Length Adv_ID DateList NumMth Price $15;
Input Adv_ID $ DateList $ NumMth $ Price $;
Datalines;
K9933-00019 01MAR2015 3 $650,000
K9933-00020 15FEB2015 2 $380,000
K9933-00021 12APR2015 4 $890,000
K9933-00022 09MAY2015 3 $890,000
K9933-00023 22FEB2015 2 $450,000
K9933-00024 08APR2015 2 $360,000
K9933-00025 03MAR2015 5 $440,000
K9933-00026 28FEB2015 3 $860,000
K9933-00027 19JAN2015 4 $740,000
K9933-00028 01JAN2015 1 $980,000
;
Run;
/*Informat to convert*/
Data Listing2;
Set Listing;
DateList2 = input(DateList, Date9.);
NumMth2 = input(NumMth, Best2.);
Price2 = input(Price, dollar8.);
Drop DateList NumMth Price;
Run;
**Instead of re-write the variable, you can create new ones and drop old ones;
Data Listing3;
Set Listing2;
Format DateList2 Date9. NumMth2 Best2. Price2 dollar8.;
Run;
/*Need to make sure the length setting in the informat is correct or conversion might be wrong
e.g. 100000000 becomes 10000000*/
/*Convert first, format later*/

/*Exercise 3 – proc contents display info: check number of vars and obs*/
Proc contents data = SASHelp.JAC;
Run;
/*Exercise 4 – calculate fahrenheit temp and display in 1 decimal place*/
data temp;
input hour temp;
datalines;
1 19.3
2 18.1
3 19.6
4 20.7
5 22.3
6 21.4
7 19.8
8 18.1
9 16.8
10 17.1
11 17.2
12 15.6
;
run;
Data Temp2;
Set Temp;
Format TempFa 4.1;
TempFa = Temp * 9/5 +32;
Run;
/*To define the decimal of a numeric variable, simply add W.D fllowed by the var name*/

/*Data Manipulation*/
/*In general data manipulation makes up 60% of SAS program*/

/*Proc sort to sort the dataset*/
Data Drinks;
Infile Datalines dsd;
Input Type $ Name:$50. Calories Fat;
Datalines;
Cold,Caramel Frappuccino® Light Blended Beverage,100,0
Cold,Coffee Frappuccino® Blended Beverage,180,2.5
Cold,Teavana® Shaken Iced Passion Tango™ Tea,0,0
Cold,Shaken Tazo® Iced Black Tea Lemonade,100,0
Cold,Nonfat Iced Vanilla Latte,120,0
Cold,Nonfat Iced Caramel Macchiato,140,1
Cold,Iced Coffee (with classic syrup),60,0
Cold,Iced Skinny Latte,70,0
Hot,Brewed Coffee,5,0
Hot,Brewed Tea,0,0
Hot,Nonfat Green Tea Latte,210,0
Hot,Nonfat Cappuccino,60,0
Hot,Nonfat Caffè Latte,100,0
Hot,Nonfat Caramel Macchiato,140,1
Hot,Soy Chai Tea Latte,180,2
Hot,Nonfat Caffè Mocha – hold the whip,170,2
Hot,Skinny Vanilla Latte,100,0
Hot,Steamed Apple Juice,170,0
Hot,Caffè Americano,10,0
;
Run;
Proc Sort Data = Drinks;
By Name;
Run;
Proc Sort Data = Drinks;
By Calories;
Run;
Proc Sort Data = Drinks;
By Descending Calories;
Run;
/*In proc sort, the descending needs to be in front of the variable and with full spelling*/
**The DESCENDING option applies only to the variable immediately after the option.;
/*Sort by two layers – Type then Fat*/
Proc Sort Data = Drinks;
By Type Fat;
Run;

/*Creating output dataset*/
Proc Sort Data = Drinks Out = SortedDrinks;
By Calories;
Run;
/*In this program, a separate output is generated so that the raw data is not overwritten*/
/*It’s a good practise to keep the original data unchanged and update the output table.*/

/*Where statement*/
Proc Sort Data = Drinks Out = HotDrinks;
By Calories;
Where Type = “Hot”;
Run;
/*In here only ‘where’ can be used, not ‘if’ as ‘If’ can only be used in data step but not proc steps*/

/*Check what’s the most expensive cars among the five brand*/
/*First. last. varaibles */
Proc Sort Data = SASHelp.Cars Out = SortedPrice;
By Make Descending MSRP;
Where Make in (“BMW”, “Audi”, “Lexus”, “Mercedes-Benz”, “Porsche”);
Run;
Data ExpCars;
Set SortedPrice;
By Make;
If First.Make;
Run;
**First. / Last. varaibles capture the first / last observation (which usually represents extreme values) in a sorted group
The first. / last. variable involved needs to be sorted as well as the KPI variable.
In this case, the first. variable (the group) is Make and the KPI variable is MSRP
Need to think these through before writing the commands;

/*Remove duplicate observations with NODUPKEY*/
Data Supermarket;
Infile datalines dsd;
Input Product : $20. Price DemandPerWeek ;
Datalines;
Campbell Soup,3.99,150
Lay’s Chip,2.99,300
Kinder Chocolate,5.99,50
Nestle Ice cream,6.99,80
Maxwell Coffee,5.99,90
Coca cola,5.99,300
Pringles,2.99,200
Pringles,2.99,200
Lipton Milk Tea,3.99,150
Flamingo Fried Chicken,8.99,60
Dempster’s Bread,1.99,450
;
Run;
Proc Sort Data = Supermarket Out = Supermarket2 NODUPKEY;
By Product Price DemandPerWeek;
Run;
/*In this example, the duplicated entry is removed in supermarket2 and supermarket stay unchanged.*/
/*The log will tell you that 1 obs with duplicate key value is deleted*/
/*the NODUPKEY should be used with caution and you should double check the duplicate key before removing them from the dataset*/

Data Income;
Input HouseholdID $ NumMembers HomeOwner $ Income $;
Datalines;
HID1001 4 Yes >120000
HID1002 3 No <120000
HID1003 2 Yes >120000
HID1004 6 Yes >120000
HID1004 6 Yes >120000
HID1005 6 No <120000
HID1006 4 Yes >120000
HID1007 5 Yes <120000
HID1008 3 Yes >120000
HID1009 5 Yes <120000
HID1010 4 No >120000
;
Run;
/*You need to remove the duplicate key with proc sort*/
Proc Sort Data = Income Out = Income2 NODUPKEY;
By HouseholdID NumMembers HomeOwner Income;
Run;
/*It’s highly recommended to create a new data set when removing the duplication, to keep original one intact.*/

/*Flagging Extreme Values – highest and lowest using first. last.*/
Data Exam;
Input Subject $ Student $ Results;
Datalines;
Math Mary 78
Math John 67
Math Tom 98
Math Chris 56
Math Amy 89
English Mary 74
English John 79
English Tom 88
English Chris 92
English Amy 45
History Mary 32
History John 96
History Tom 55
History Chris 67
History Amy 86
;
Run;
Proc Sort Data = Exam;
By Subject Results;
Run;
Data Exam2;
Set Exam;
By Subject Results;
If first.subject then i = 1;
else if last.subject then i = 2;
Run;
/*Multiple if then conditions can be used with if then else if then else if then else*/
/*In data step, ‘by’ is equivalant to ‘group by’ in sql*/
**There is ‘by’ statement in sort as well as in the data step with first. / last.;
**The ‘by’ statement in data step is important but often neglected. Without the ‘by’ statement the flagging won’t work;

/*Data set concatenation – Append datasets with same format and variables with Set statement*/
Data List1;
Input Sku $ Product $ QTY Price;
Datalines;
S1290001 Laptop 50 979
S1290002 Laptop 30 632
S1290003 Desktop 100 1299
S1290004 Desktop 80 1599
S1290005 Laptop 10 2999
;
Run;

Data List2;
Input Sku $ Product $ QTY Price;
Datalines;
S1290006 Printer 300 229
S1290007 Printer 400 467
S1290008 Printer 100 899
S1290009 Cable 1200 8.99
S1290010 Cable 900 7.99
;
Run;
/*for datafiles with same var names from structutred database, you can append then with Set. e.g. the gse data files*/
Data List;
Set List1 List2;
Run;

Data P2;
Set SASHelp.PRDSAL2;
Run;
Data P3;
Set SASHelp.PRDSAL3;
Run;
/*Check the total number of obs and separated ones from the log*/
Data Ptotal;
Set P2 P3;
Run;

**Two common issues when concatenating data sets
1. Inconsistent variable name
2. Data Truncation;

/*Example: Concatenating the data sets with same format but different var names*/
Data Report1;
Input PatientID $ Gender $ Age;
Datalines;
P0001 M 32
P0002 F 28
P0003 M 48
P0004 F 39
P0005 M 25
;
Run;
Data Report2;
Input PID : $12. Gender $ Age;
Datalines;
P00000000001 M 32
P00000000002 F 28
P00000000003 M 48
P00000000004 F 39
P00000000005 M 25
;
Run;
Data Report;
Set Report1 Report2;
Run;
**In this example, the columns with the same var names are concastinated while the vars with same info but different names
are stored in separate rows;
**To resolve this issue, the names need to be unified. This can be done using the RENAME function;
Data Report;
Set Report1 Report2(Rename=(PID = PatientID));
Run;
/*Now we check the data, seems like the columns are apended properly.*/
/*Oh wait, there is a warning in the log saying that multiple lengths are specified and might cause truncation*/
/*We check back the data and we found it’s true that the PatientID originally from Report2 has been trauncated*/
/*The length is 8 instead of the 12 defined*/
**In Set statement that apends the data, the length of the variables are based on the first data set.;
/*Thus a Length statement needs to be used to define the correct length*/
Data Report;
Length PatientID $12.;
Set Report1 Report2 (Rename = (PID = PatientID));
Run;
**In summary, these two issues can be solved with Rename and Length statements;

data desks1;
infile datalines dsd;
length Brand $10 category $12;
input Brand $ Category $ Price Width Depth Height;
datalines;
MICKE,Desk,59.99,73,50,75
BEKANT,Desk,179.00,140,60,65
MICKE,Work station,149.00,105,50,140
VITTSJÖ,Laptop table,39.99,100,36,74
BESTÅ BURS,Desk,199.00,120,40,74
;
Run;

data desks2;
infile datalines dsd;
length Product $12 Type $28;
input Product $ Type $ Price WGT DPT HGT;
datalines;
HEMNES,Desk with add-on unit,449.00,155,65,137
MALM,Desk with pull-out panele,179.00,151,65,73
IKEA PS 2014,Standing laptop station,69.99,45,34,96
VITTSJÖ,Laptop stand,24.99,55,65,35
MICKE,Desk with integrated storage,99.00,120,50,75
;
Run;
/*You can do multiple rename statements.*/
Data Desks;
Length Brand $12 category $28;
Set desks1 desks2(Rename = (Product = Brand Type = Category WGT = Width DPT = Depth HGT = Height));
Run;

/*Merge data sets*/
Data Profile;
Input Student $ Gender $ Grade;
Datalines;
SID0001 M 7
SID0004 F 7
SID0002 F 7
SID0003 M 8
SID0007 M 6
SID0005 F 9
SID0006 F 9
SID0009 M 7
SID0008 M 7
SID0010 F 10
;
Run;

Data Exam;
Input Student $ Result;
Datalines;
SID0009 70
SID0001 78
SID0004 99
SID0002 90
SID0003 81
SID0008 81
SID0005 78
SID0006 66
SID0007 34
SID0010 66
;
Run;
**The data sets need to be sorted by merging key before they can be merged;
**Always sort before using the merge statement;
/*The three statement could not be run alone if the data sets are not sorted by the key*/
Proc Sort Data = Profile;
By Student;
Run;

Proc Sort Data = Exam;
By Student;
Run;

Data Combined;
Merge Profile Exam;
By Student;
Run;

**Two steps merging the data sets;
**Step 1: sort each of the input data sets by the merging variable(s);
**Step 2: Merge the data sets by the merging variables;

Data Cust;
Input ID Zip ;
Datalines;
9000135 14304
9000134 14301
9000138 14308
9000136 14308
9000137 14304
9000142 14304
9000139 14300
9000140 14301
9000141 14308
;
Run;

Data Purchase;
Input ID Purchase : dollar5.0;
Datalines;
9000135 720
9000133 1002
9000134 569
9000140 890
9000141 780
9000136 109
9000137 30
9000138 4566
9000139 50

;
Run;

Proc Sort Data = Cust;
By ID;
Run;
Proc Sort Data = Purchase;
By ID;
Run;

Data PurchaseRecord;
Merge Cust Purchase;
By ID;
Run;

/*Identify the merging variable and rename them if they are not in the same name*/
Data Population;
Input Country $ City: $30. Population;
Datalines;
Canada Toronto 6000000
Canada Montreal 4000000
Canada Vancouver 2400000
US Chicago 2700000
US New_York 8400000
US Los_Angeles 3800000
Mexico Mexico_City 8500000
Mexico Cancun 620000
;
Run;

Data Uber;
Input Country $ Cities: $30. NumDriver;
Datalines;
US Chicago 20000
US New_York 14000
US Los_Angeles 16000
Canada Toronto 13000
Canada Montreal 5000
Mexico Mexico_City 20000
Mexico Cancun 11000
Canada Vancouver 7000
;
Run;
Proc Sort Data = Population;
By Country City;
Run;
Proc Sort Data = Uber;
By Country Cities;
Run;
/*We need to rename the variable name to ensure the consistent names for merging*/
Data UberData;
Merge Population Uber(Rename = (Cities = City));
By Country City;
Run;
Data PopulationPerDriver;
Set UberData;
PPD = Population / NumDriver;
Run;
Proc Sort Data = PopulationPerDriver;
By PPD;
Run;

**What happens if there are unmatched cases from either/both side of the data sets?;
**The data will still get merged, the unmatched parts will be replaced with ‘.'(numeric) or ‘ ’(char);
/*It’s the same logic as append */
Data DS1;
Input ID Gender $;
Datalines;
9001 Male
9002 Female
9003 Male
9004 Male
9005 Female
;
Run;

Data DS2;
Input ID Income;
Datalines;
9003 50000
9004 60000
9005 75000
9006 65000
9007 45000
;
Run;
Proc Sort Data=ds1;
By id;
Run;
Proc Sort data=ds2;
by id;
run;
Data ds;
Merge ds1 ds2;
By id;
Run;

**How to identify the unmatched observations;
**Use if then logic and In function;
Data ds;
Merge ds1 (In = a)ds2 (In = b);
By id;
If a and b Then i = 1;
Else If a Then i = 2;
Else If b Then i = 3;
Run;

Data Transac;
Input Order $ Province : $30. Total;
Datalines;
BA00001 Ontario 1432
BA00002 Ontario 1455
BA00003 Ontario 2435
BA00004 Quebec 894
BA00005 Quebec 1745
BA00006 Quebec 997
BA00007 Alberta 562
BA00008 Alberta 132
BA00009 Alberta 987
BA00010 Manitoba 562
;
Run;

Data Tax;
Input Province : $30. Tax;
Datalines;
Ontario 13
Quebec 14.975
Alberta 5
Manitoba 5
;
Run;

Proc Sort Data = Transac;
By Province;
Run;

Proc Sort Data = Tax;
By Province;
Run;

Data TaxCal;
Merge Transac Tax;
By Province;
AfterTax = Total * (1+Tax/100);
Run;

Data ds;
Merge ds1 (In = a)ds2 (In = b);
By id;
If a and b Then i = 1;
Else If a Then i = 2;
Else If b Then i = 3;
Run;

/*All can be matched from both sides as i = 1*/
Data Ck;
Merge Transac (In = a) Tax (In = b);
By Province;
If a and b Then i = 1;
Else If a Then i = 2;
Else If b Then i = 3;
Run;

/*(In =) Option*/
Data DS1;
Input ID Gender $;
Datalines;
9001 Male
9002 Female
9003 Male
9004 Male
9005 Female
;
Run;

Data DS2;
Input ID Income;
Datalines;
9003 50000
9004 60000
9005 75000
9006 65000
9007 45000
;
Run;

Proc Sort Data=ds1;
By id;
Run;

Proc Sort Data=ds2;
By id;
Run;

/*Although a and b are mentioned in the data step, they are not visible in the dataset*/
data ds;
Merge ds1 (In=a) ds2 (In=b);
By id;
Run;

/*Since a and b are mentioned in data step, they can be referred in your program*/
/*It can be used to create varaibles*/
Data ds;
Merge ds1(In = a) ds2 (In = b);
By id;
Origds1 = a;
Run;
/*Var Origds1 will be flaged as 1 if true, 0 if false*/

/*Transposing data set with proc transpose*/
**Trasposing data set is a common data manipulation task;
Data Rev;
Input Month $ Revenue;
Datalines;
Jan 253000
Feb 460000
Mar 365000
;
Run;
/*Id is the row to be and Var is the ob to be*/
Proc Transpose Data = Rev Out = t_Rev;
Id Month;
Var Revenue;
Run;
**Three parts of the Proc Transpose:
1.Out Option: Creates the transposed dataset. If out if not defined, the transposed data set will not be output/saved.
2.ID Option: Specifies the variable to be transposed as the column headers (rows)
3.Var Statement: Specifies the results to be transposed;

Data TEMP;
Input Hours $ Temperature;
Datalines;
H0 15
H1 13
H2 13
H3 14
H4 14
H5 14
H6 13
H7 14
H8 14
H9 14
H10 14
H11 15
H12 16
H13 17
H14 16
H15 16
H16 16
H17 15
H18 15
H19 14
H20 14
H21 14
H22 14
H23 15
;
Run;

Proc Transpose Data = Temp Out = t_Temp;
ID Hours;
Var Temperature;
Run;
/*Use Sum(of V1, V2, V3) to sum the variables*/
Data CleanTemp (Drop = _NAME_);
Set t_Temp;
Bf12pm = Sum(of H0, H1, H2, H3, H4, H6, H6, H7, H8, H9, H10, H11);
Af12pm = Sum(of H12, H13, H14, H15, H16, H17, H18, H19, H20, H21, H22, H23);
Run;

**When reverse transposing a data set, the ID statement is not needed.;
/*13-18*/
Data Q2;
Input Store $ month $ profit;
Datalines;
Store1 April 169000
Store2 April 105000
Store1 May 104000
Store2 May 130000
Store1 June 178000
Store2 June 145000
;
Run;
**The problem when transposing this data set is that there are more than 1 observation in each of the ID groups.;
/*This is an example that leads to an error*/
Proc Transpose Data = Q2 Out = t_Q2;
ID Month;
Var Profit;
Run;
/*This cannot be executed because the ID value April occurs twice and it’s not acceptable as ID parameter*/
/*Now it’s time to introduce By Statement*/
Proc Sort Data = Q2;
By Store;
Run:
/*This command must be executed first as proc sort should be always used before By is used in other steps
If the Proc Transpose with By statement are run alone, they cannot be properly executed*/
/*With only 1 observation per store, SAS is able to transpose the data set without error*/
Proc Transpose Data = Q2 Out = t_Q2;
ID Month;
Var Profit;
By Store;
Run;
/*Now it becomes a crosstab*/

Data Flight;
Infile datalines dsd;
Input From : $30. To : $30. Airline : $30. Price;
Datalines;
Toronto,New York,Air Canada,600
Toronto,New York,American Airline,400
Toronto,New York,Virgin Airline,700
Toronto,Calgary,Air Canada,550
Toronto,Calgary,American Airline,750
Toronto,Calgary,Air Transat,600
Toronto,Hong Kong,Air Canada,1950
Toronto,Hong Kong,Cathay Pacific,1850
Toronto,Hong Kong,Hainan Airline,1600
;
Run;

Proc Sort Data = Flight;
By From To;
Run;

Proc Transpose Data = Flight Out = t_Flight;
ID Airline;
Var Price;
By From To;
Run;

/*Getting the Min fare should be a separate step*/
Data Cheapest;
Set t_Flight;
MinFare = Min(Air_Canada, American_Airline, Virgin_Airline,Air_Transat,Cathay_Pacific,Hainan_Airline);
Run;

/*Reverse Transpose*/
**Long to wide is transpose and wide to long is reverse transpose;
**The command is still proc transpose;
Data Food;
Input Restaurant : $30. Coffee Muffin Cookies;
Datalines;
McDonald’s 1.75 1.5 1.3
;
Run;

Proc Transpose Data = Food Out = rt_Food;
Var Coffee Muffin Cookies;
Run;
/*This statement will transpose the data set and the restaurant variable is also gone.*/
**The Var variable lists the variables to be reverse transposed.;
**The ID statement is only needed when doing stardard transpose and it’s not needed for reverse transpose;

Data t_q2;
Input Store $ _name_ $ April May June;
Datalines;
Store1 profit 169000 104000 178000
Store2 profit 105000 130000 145000
;
Run;
/*Perform a reverse transpose on T_Q2 and combine the profit from April, May, and June back into one column.*/
Proc Transpose Data = t_q2 Out = rt_q2;
Var April May June;
By Store;
Run;
/*Think throught what’s the Var and By in the proc transpose*/

/*Variable _N_ and _ERROR_*/
Data Temp;
Input Num1 Num2;
Datalines;
100 5
101 4
102 3
103 2
104 1
105 0
106 -1
107 -2
108 -3
109 -4
110 -5
;
Run;
**Variable _N_ shows the observation number in the data set.;
**Similiar to the ‘a’ in (in = a) option, variable _n_ does not exsist in the data set;
**To really see the variable _N_, you need to copy _N_ into a variable that is visible in the data set;
Data Temp2;
Set Temp;
obsnum = _N_;
Run;
**Variable _ERROR_ is another automatic variable built in the data step;
**It identifies the observation with error;
Data Temp3;
Set Temp;
Num3 = Num1 / Num2;
obserr = _ERROR_;
Run;
**_ERROR_ flags the observation as ‘1’ when there is an error, and ‘0’ when there is not.;
**You can use _ERROR_ to identify the problematic observation;
Proc Sort Data = SASHelp.Cars Out = Cars;
By Descending Horsepower;
Run;

Data Top3HP;
Set Cars;
If _N_ in (1, 2, 3);
Run;
/*Use ‘if’ and not ‘where’ in the above case as where only indexes the variable in the data set*/
/*In proc statements, only where but not if can be used*/

/*Retain Statement*/
Data Revenue;
Infile datalines dsd;
Input Week : $10. Revenue;
Datalines;
Week 1, 4000
Week 2, 5000
Week 3, 8000
Week 4, 6500
Week 5, 8900
Week 6, 4500
Week 7, 9800
Week 8, 10000
Week 9, 6800
Week 10, 6300
;
Run;
Data Revenue2;
Set Revenue;
If _N_ = 1 Then Total = 0;
Run;
/*In this case, the first ob will have the Total var = 0 but the rest have missing value for that var.*/

/*Now we try using the retain varaible*/
Data Revenue2;
Set Revenue;
Retain Total;
If _N_ = 1 Then Total = 0;
Run;
/*After it’s executed, all other values for Total are filled with 0*/
**The retain statement retrains the value of ‘0’ from the first ob to the next’
**The first ob from total is assigned the value of ‘0’. This value retains to the next ob and as a result,
the entire column of Total consists of ‘0’;

**The main application of the retain statement is to perform cumulative summation.;
Data Revenue3;
Set Revenue;
Retain Total;
/*If _N_ = 1 Then Total = 0;*/
Total = Sum(Revenue, Total);
Run;
/*It means that the summation result will be carried over from row to row.*/
/*If the retain condition is not set in the program above, Total will be equal to Revenue as Total = 0 in the first place*/
**Note: setting the initial value to the variable TOTAL in this program is actually not required.;

Data Inventory;
Infile datalines dsd;
Input Product : $30. Store : $10. Qty;
Datalines;
Billy’s Bookshelf,Store 1, 150
Billy’s Bookshelf,Store 2, 200
Billy’s Bookshelf,Store 3, 230
Billy’s Bookshelf,Store 4, 160
Billy’s Bookshelf,Store 5, 180
Billy’s Bookshelf,Store 6, 220
Billy’s Bookshelf,Store 7, 130
Billy’s Bookshelf,Store 8, 450
Billy’s Bookshelf,Store 9, 260
Billy’s Bookshelf,Store 10, 300
;
Run;

Data Inventory2;
Set Inventory;
Retain Total;
Total = Sum(Qty, Total);
Run;
/*Firstly identify which one is the var to retain and what’s the total variable*/

/*Assigning initial values with Retain Statement*/
**There are 3 ways to assign initial values with a Retain statement:
**Retain Var1 0;
Data Revenue2;
Set Revenue;
Retain Total 0;
Run;
**Retain Var1 . (missing value);
Data Revenue4;
Set Revenue;
Retain Total .;
Run;

**-Retain Var1;
/*Retain Statement is to assign the starting value. It’s usually the three mentioned above, it can also be assigned to numbers such as 100*/
Data Revenue2;
Set Revenue;
Retain Total 100;
Run;
/*If you assign Retain Var1 without any varaible definition fomula, the variable will not be created*/
/*See the example below*/
Data Revenue5;
Set Revenue;
Retain Total;
Run;
**When computing cumulative summation, assigning initial value is not needed and the summation can be carried out;
Data Revenue3;
Set Revenue;
Retain Total;
Total = Sum(Revenue, Total);
Run;
/*If you in the case above specifying the Total as 100, 100 will be added since the first Total calculation*/
/*You can see that from the example below: assuming there is an initial inventory of 1500*/
Data Inventory;
Infile datalines dsd;
Input Product : $30. Store : $10. Qty;
Datalines;
Billy’s Bookshelf,Store 1, 150
Billy’s Bookshelf,Store 2, 200
Billy’s Bookshelf,Store 3, 230
Billy’s Bookshelf,Store 4, 160
Billy’s Bookshelf,Store 5, 180
Billy’s Bookshelf,Store 6, 220
Billy’s Bookshelf,Store 7, 130
Billy’s Bookshelf,Store 8, 450
Billy’s Bookshelf,Store 9, 260
Billy’s Bookshelf,Store 10, 300
;
Run;
Data Inventory2;
Set Inventory;
Retain Total 1500;
Total = Sum(Qty, Total);
Run;

/*(+) Expression*/
**Combining the use of the Retain and Sum function allow you to compute the summation across the obs. The ‘+’ is a shortcut to also do that;
Data Inventory;
Infile datalines dsd;
Input Product : $30. Store : $10. Qty;
Datalines;
Billy’s Bookshelf,Store 1, 150
Billy’s Bookshelf,Store 2, 200
Billy’s Bookshelf,Store 3, .
Billy’s Bookshelf,Store 4, 160
Billy’s Bookshelf,Store 5, .
Billy’s Bookshelf,Store 6, 220
Billy’s Bookshelf,Store 7, 130
Billy’s Bookshelf,Store 8, 450
Billy’s Bookshelf,Store 9, 260
Billy’s Bookshelf,Store 10, 300
;
Run;
Data Inventory2;
Set Inventory;
Total + Qty;
Run;
**The code does the following:
1. Creates the variable Total with no initial value (missing)
2. The varaible total has a built-in retain capability. Its value is retained from one ob to the next.
3. The (+Revenue) syntax adds the revenue to the varaible Total on each ob.

**The (+) can also work in the case of missing values;

/*Exercise 1*/
/*Sort the data set by highest selling price (i.e. HIGH) in descending order*/
Data Stocks;
Set SASHelp.Stocks;
Run;
Proc Sort Data = Stocks Out = SortedStocks;
By Stock Descending High;
Run;

/*Exercise 2*/
/*Create a data set that contains the highest and lowest trade volume for each stock.*/
Proc Sort Data = Stocks Out = RStocks;
By Stock Volume;
Run;
/*Must think through what’s the sort variable and what’s the First. / Last. variable*/
/*There is a by statement in first./last. operations and proc sort needs to be done before that.*/
Data MLStocks;
Set RStocks;
By Stock Volume;
If First.Stock or Last.Stock;
Run;

/*Exercise 3*/
/*Find out the difference between the highest and lowest selling price for each stock between January 1, 1990, to December 31, 1999.*/
Data Ck;
Length date $10;
Input date $;
Datalines;
1990-01-01
1999-12-31
;
Run;
/*to convert to numeric, the parameter should be it’s original format*/
Data Cvt;
Set Ck;
date2 = input(date, yymmdd10.);
Run;
/*10958 correspond to January 1, 1990 and 14609 correspond to December 31, 1999*/
Data DStocks (Keep = Date Stock High Low Diff);
Set Stocks;
Diff = High – Low;
Date2 = Date;
Where Date >= 10958 and Date <= 14609;
Run;

/*Model Answer:*/
Proc Sort Data=sashelp.stocks out=high;
by stock high;
where date>=’01JAN1990’D and date<=’01DEC1999’D;
run;
/*To refer to the date, we can simply use date>=’01JAN1990’D*/

/*Model Answer:*/
/*I understood wrongly. It’s the highest – lowest in each stock (i.e. brand) not each entry*/
Proc Sort Data=sashelp.stocks out=high;
by stock high;
where date>=’01JAN1990’D and date<=’01DEC1999’D;
run;
data high2;
set high;
by stock high;
if last.stock;
keep stock high;
run;
Proc Sort Data=sashelp.stocks out=low;
by stock low;
where date>=’01JAN1990’D and date<=’01DEC1999’D;
run;
data low2;
set low;
by stock low;
if first.stock;
keep stock low;
run;
data diff;
merge high2 low2;
by stock;
diff = high-low;
run;
Proc Sort Data=sashelp.stocks out=high;
by stock high;
where date>=’01JAN1990’D and date<=’01DEC1999’D;
run;
data high2;
set high;
by stock high;
if last.stock;
keep stock high;
run;
Proc Sort Data=sashelp.stocks out=low;
by stock low;
where date>=’01JAN1990’D and date<=’01DEC1999’D;
run;
data low2;
set low;
by stock low;
if first.stock;
keep stock low;
run;
data diff;
merge high2 low2;
by stock;
diff = high-low;
run;

/*Exercise 4*/
/*Find out the product that shows the highest return rate across all of the regions.*/
Data Shoes;
Set SASHelp.Shoes;
Rate = Returns / Sales;
Run;
Proc Sort Data = Shoes Out = StShoes;
By Region Rate;
Run;
/*There must be a by statement in first. / last. and the by statement is the same as in proc sort in previous step*/
Data HRateShoes;
Set StShoes;
By Region Rate;
If Last.Region;
Run;
Proc Sort Data = HRateShoes Out = HHRate;
By Descending Rate;
Run;

/*Model Answer:*/
Data Shoes;
Set sashelp.shoes;
rate = returns / sales;
arbi = 1; * Arbitrary variable *;
Run;
Proc sort data=shoes;
by rate;
Run;
Data Shoes2;
Set shoes;
by arbi rate;
if last.arbi;
Run;
**Please note that an arbitrary variable is needed when there is no grouping variable.;
/*In this case, we can just use arbitrary varaible to get the highest return rate among
all regions as highest among all regions -> highest in this data set. We only need to group by regions when we need to find
the highest in each region*/

/*Exercise 5*/
/*How to get subtotal by categories*/
/*The total bonus to be given out will be 5% of the total sales across all region. Calculate the total bonus to be given out to all regional managers.*/
/*Use retain and first last*/
Proc Sort Data = Shoes;
By Region;
Run;

Data TotalR;
Set Shoes;
By Region ;
If First.Region Then TotalR = 0;
TotalR +Sales;
Arbi = 1;
If Last.Region;
Keep Region Arbi TotalR;
Run;

/*Exercise 6*/
/*Get contribution*/
Data TotalC;
Set TotalR;
By Arbi;
Retain Total;
Total = Sum(TotalR, Total);
If Last.Arbi;
Keep Region Arbi Total;
Run;
/*You need to specify that the merge is by Arbi, or else it will be merged by region*/
/*If it’s merged by region, only western europe will get merged and the rest will be missing values, while if use arbi, all can be merged*/
/*The good use of Arbitrary variable*/
Data Contrib;
Merge TotalR TotalC;
By Arbi;
Contrib = TotalR/Total;
Run;

/*Model Answer:*/
/*The good use of Merge*/
Data Shoes;
Set sashelp.shoes;
arbi = 1;
Run;

Data Total;
Set Shoes;
By Arbi;
Total+Sales;
If last.arbi;
keep Arbi total;
Run;

** Calculate sales from each region **;
Proc Sort Data=Shoes;
By region;
Run;

Data TotalR;
Set Shoes;
By region;
if first.region then totalR=0;
Arbi = 1;
TotalR+Sales;
If last.region;
keep Region Arbi totalR;
Run;

** Calculate percentage contribution and bonus **;
Data Contri;
Merge TotalR Total;
By Arbi;
ContriPct = TotalR / Total;
Bonus = ContriPct * Total * 0.05;
Run;

[TTStats] What is a t-test?

by pingpongsg on August 28, 2018

In 2014/2015, the material used to make table tennis balls changed from celluloid to a non-flammable plastic. These balls are a tiny bit bigger than the 40mm celluloid balls, but the main difference is the type of material they are made from. According to the information online, the balls have less spin and bounce after the change of materials.

As a data person, I want to measure whether there are big differences in the bounciness of the two types of balls. I get a box of plastic balls and another of celluloid balls, both of which have 10 pieces of new balls inside. The way I measure the bounciness is that I let each ball free fall at the height of 300mm onto the table tennis table and measure what’s the height of its first bounce.

After 20 free falls, I have obtained the following data:

	1	2	3	4	5	6	7	8	9	10
Plastic	219.831	218.499	219.166	219.5	220.332	218.666	220.333	219.164	220.021	220.332
Celluloid	227.489	229.833	231.164	228.33	231.152	232.665	227.658	230.332	230.49	229.89

With these data, we can calculate the average of the height of each type of balls’ first bounces:

Plastic	219.584mm
Celluloid	229.9mm

From the mean figures, we can see that the average bounce height of plastic balls is almost 10mm smaller than celluloid balls. However, what does it really tell us? Does it mean that there is a big difference between the heights of the first bounce of the two types of balls? Is it possible that the 10mm differences are just by chance (e.g. in a new test, plastic balls may have higher first bounce heights than the celluloid balls)? How can we draw the conclusion that there are a big enough differences in the heights that we can prove that the difference in the bouncy level does exist?

The distribution of the heights is plotted on the picture below. There are obvious differences and the minimum of the celluloid height is already greater than the maximum height of plastic, however, is it conclusive enough?

In order to figure out this question, we need to introduce the t-test.

The t-test is a commonly used technique that measures the differences between sample means. It calculates the differences between the groups and compares it to the differences within the groups

T= variance between groups/variance within groups

A big t-value indicates different groups and a small t-value indicates similar groups.

How do we know a t-value is big enough to decide that the groups are very different? Each t-value corresponds to a p-value which indicates the possibility of ‘there is no real difference between the groups’. A p-value of less than 0.05 says that the probability of two groups being indifferent is less than 5%. The smaller the p-value, the more likely that the two means are different. Usually, we take 5% as a threshold of judging whether the difference is significant.

Now back to our example, we run a t-test on the data we collected and below are the outputs:

The p-value is 3.955e-10, which means that there is less than 0.00001% probability that the average bounce heights of plastic balls and celluloid balls are indifferent. This proves that the bounce heights of plastic balls are significantly lower than the celluloid balls and they are much less bouncy.

The t-test that we use to compare the means between two groups is called an independent t-test. There are also paired t-test and one-sample t-test.

The paired t-test is to test the means of the same group twice. The advantage of this t-test is that it reduces possibly variability of subjects, however, it’s prone to order effects. e.g. the body temperature of a group of runners before and after a marathon.

The one-sample t-test is to compare the group mean with a hypothesized/known sample mean. e.g. the average IQ of a group of Cambridge graduate vs. 100

*The plastic vs. celluloid ball example I gave is just for storytelling purpose. I did not conduct that experiment. The tests and the figures are from the experiment done by Preston Table Tennis Association. If you are interested in the experiment, kindly check out their video: Plastic & Celluloid Table Tennis Balls Compared: Bounce and Conformity

SAS Notes

[TTStats] What is a t-test?

Cart