📌 What is PROC SORT in SAS?
PROC SORT
is a procedure used to sort a SAS dataset by one or more variables. Sorting is essential for data preparation, reporting, and ensuring that other procedures work as expected.
✅ Basic Syntax:
The DATA= and OUT= options specify the input and output data sets respectively
Example:
Data cars;set sashelp.cars;run;proc sort data=sashelp.cars out=cars;by Make;run;
With one BY variable, SAS sorts the data based on the values of that variable. If more than one variable, SAS sorts observations by the first variable, then by the
second variable within categories of the first, and so on.
proc sort data=cars;by Make Model;run;
by Make Model;
run;
proc sort data=cars;
by Make Model;
run;
proc sort data=cars;
by Make Model;
run;
By default Proc sort sorts data in ascending order. To sort data in descending order, add keyword DESCENDING to BY statement before variable that you want to sort in descending
Example:
proc sort data=cars;
by Make descending Model;
run;
🧹 Removing Duplicates with PROC SORT -
NODUP
and NODUPKEY
—to efficiently remove duplicates.In this post, you'll learn how to use PROC SORT
to:
- Sort data in ascending or descending order
- Eliminate duplicate rows using
NODUP
- Eliminate rows with duplicate keys using
NODUPKEY
SAS provides two key options:
1. 🔁 NODUP
(or NODUPREC
)
Removes completely duplicate rows — i.e., observations where all variable values are identical.
🔍 Explanation:
_all_
tells SAS to consider all variables.NODUP
eliminates rows that are 100% duplicates.
2. 🔁 NODUPKEY
Removes duplicates based on the BY variable(s). If multiple rows share the same value(s) in the BY variable(s), only the first occurrence is kept.
🔍 Explanation:
- Keeps only one record per
customer_id
. - Other columns (e.g.,
name
,email
) are not considered.
📊 Example Dataset
customer_id name purchase 101 John 100 101 John 100 102 Alice 200 103 Bob 150 101 John 120
customer_id | name | purchase |
---|---|---|
101 | John | 100 |
101 | John | 100 |
102 | Alice | 200 |
103 | Bob | 150 |
101 | John | 120 |
✅ Using NODUP
:
🧾 Output:
customer_id | name | purchase |
---|---|---|
101 | John | 100 |
101 | John | 120 |
102 | Alice | 200 |
103 | Bob | 150 |
☑️ Removes 1 identical row.
✅ Using NODUPKEY
:
🧾 Output:
customer_id | name | purchase |
---|---|---|
101 | John | 100 |
102 | Alice | 200 |
103 | Bob | 150 |
☑️ Keeps only the first entry per customer_id.
⚠️ Key Differences Between NODUP and NODUPKEY
Feature NODUP
NODUPKEY
Compares All variables in the dataset Only BY variable(s) Removes Fully identical rows Duplicate keys, keeps first row Use case Exact duplicates Group-wise deduplication
Feature | NODUP | NODUPKEY |
---|---|---|
Compares | All variables in the dataset | Only BY variable(s) |
Removes | Fully identical rows | Duplicate keys, keeps first row |
Use case | Exact duplicates | Group-wise deduplication |
🔧 Best Practices
- Always sort by the variables you want to deduplicate.
- Use
NODUPKEY
when you're only interested in the first occurrence of a key. - Use
NODUP
to eliminate true row-level duplicates. - Consider
FIRST.
orLAST.
variables in aDATA
step for more control.
✅ Real-world Use Case
Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer.
💡 This ensures you retain just one record per customer for reporting or analysis.
🔍 Conclusion
Removing duplicates is a fundamental step in data cleaning. With just a few lines of code, PROC SORT
allows you to:
- Organize your data
- Clean unnecessary duplicates
- Prepare your dataset for reliable analysis
Learning how to use NODUP
and NODUPKEY
will make your SAS data processing cleaner, faster, and more accurate.
0 Comments
If you have any doubt please comment or write us to - datahark12@gmail.com