🔄 PROC SORT in SAS: Sorting Data and Removing Duplicates Using NODUP and NODUPKEY Options

📌 What is PROC SORT in SAS?

PROC SORT is a procedure used to sort a SAS dataset by one or more variables. Sorting is essential for data preparation, reporting, and ensuring that other procedures work as expected.

✅ Basic Syntax:

proc sort data=input_dataset out=output_dataset;
by variable1 variable2; run;

The DATA= and OUT= options specify the input and output data sets respectively

Example:

Data cars;
set sashelp.cars;
run;

proc sort data=sashelp.cars out=cars;
by Make;
run;

Proc SORT in SAS


You can specify as many variables in By statement as you wish. 

With one BY variable, SAS sorts the data based on the values of that variable. If more than one variable, SAS sorts observations by the first variable, then by the

second variable within categories of the first, and so on. 

Example:

proc sort data=cars;
by Make Model;
run;

by Make Model;
run;
proc sort data=cars;
by Make Model;
run;
proc sort data=cars;
by Make Model;
run;

Proc SORT in SAS by Datahark


By default Proc sort sorts data in ascending order. To sort data in descending order, add keyword DESCENDING to BY statement before variable that you want to sort in descending 

Example: 

proc sort data=cars;

by Make descending Model;

run;

Proc SORT in SAS


🧹 Removing Duplicates with PROC SORT -

When working with datasets in SAS, duplicate records can cause errors in analysis and reporting. Fortunately, PROC SORT provides two simple but powerful options—NODUP and NODUPKEY—to efficiently remove duplicates.

In this post, you'll learn how to use PROC SORT to:

  • Sort data in ascending or descending order
  • Eliminate duplicate rows using NODUP
  • Eliminate rows with duplicate keys using NODUPKEY

SAS provides two key options:

1. 🔁 NODUP (or NODUPREC)

Removes completely duplicate rows — i.e., observations where all variable values are identical.

proc sort data=raw_data out=nodup_data nodup;
by _all_; run;

🔍 Explanation:

  • _all_ tells SAS to consider all variables.
  • NODUP eliminates rows that are 100% duplicates.


2. 🔁 NODUPKEY

Removes duplicates based on the BY variable(s). If multiple rows share the same value(s) in the BY variable(s), only the first occurrence is kept.

proc sort data=raw_data out=unique_keys nodupkey;
by customer_id; run;

🔍 Explanation:

  • Keeps only one record per customer_id.
  • Other columns (e.g., name, email) are not considered.


📊 Example Dataset

customer_idnamepurchase
101John100
101John100
102Alice200
103Bob150
101John120

✅ Using NODUP:

proc sort data=shop out=cleaned_data nodup;
by _all_; run;

🧾 Output:

customer_idnamepurchase
101John100
101John120
102Alice200
103Bob150

☑️ Removes 1 identical row.


✅ Using NODUPKEY:

proc sort data=shop out=unique_ids nodupkey;
by customer_id; run;

🧾 Output:

customer_idnamepurchase
101John100
102Alice200
103Bob150

☑️ Keeps only the first entry per customer_id.


⚠️ Key Differences Between NODUP and NODUPKEY

FeatureNODUPNODUPKEY
ComparesAll variables in the datasetOnly BY variable(s)
RemovesFully identical rowsDuplicate keys, keeps first row
Use caseExact duplicatesGroup-wise deduplication

🔧 Best Practices

  • Always sort by the variables you want to deduplicate.
  • Use NODUPKEY when you're only interested in the first occurrence of a key.
  • Use NODUP to eliminate true row-level duplicates.
  • Consider FIRST. or LAST. variables in a DATA step for more control.


✅ Real-world Use Case

Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer.

proc sort data=transactions out=first_purchase nodupkey;
by customer_id; run;

💡 This ensures you retain just one record per customer for reporting or analysis.


🔍 Conclusion

Removing duplicates is a fundamental step in data cleaning. With just a few lines of code, PROC SORT allows you to:

  • Organize your data
  • Clean unnecessary duplicates
  • Prepare your dataset for reliable analysis

Learning how to use NODUP and NODUPKEY will make your SAS data processing cleaner, faster, and more accurate.

Post a Comment

0 Comments