🔄 PROC SORT in SAS: Sorting Data and Removing Duplicates Using NODUP and NODUPKEY Options

📌 What is PROC SORT in SAS?

PROC SORT is a procedure used to sort a SAS dataset by one or more variables. Sorting is essential for data preparation, reporting, and ensuring that other procedures work as expected.

✅ Basic Syntax:

proc sort data=input_dataset out=output_dataset;
  by variable1 variable2;
run;

The DATA= and OUT= options specify the input and output data sets respectively

Example:

Data cars;
set sashelp.cars;
run;

proc sort data=sashelp.cars out=cars;
by Make;
run;

You can specify as many variables in By statement as you wish.

With one BY variable, SAS sorts the data based on the values of that variable. If more than one variable, SAS sorts observations by the first variable, then by the

second variable within categories of the first, and so on.

Example:

proc sort data=cars;
by Make Model;
run;

by Make Model;
run;

proc sort data=cars;
by Make Model;
run;

proc sort data=cars;
by Make Model;
run;

By default Proc sort sorts data in ascending order. To sort data in descending order, add keyword DESCENDING to BY statement before variable that you want to sort in descending

Example:

proc sort data=cars;
by Make descending Model;
run;

🧹 Removing Duplicates with PROC SORT -

When working with datasets in SAS, duplicate records can cause errors in analysis and reporting. Fortunately, PROC SORT provides two simple but powerful options—NODUP and NODUPKEY—to efficiently remove duplicates.

In this post, you'll learn how to use PROC SORT to:

Sort data in ascending or descending order
Eliminate duplicate rows using NODUP
Eliminate rows with duplicate keys using NODUPKEY

SAS provides two key options:

1. 🔁 `NODUP` (or `NODUPREC`)

Removes completely duplicate rows — i.e., observations where all variable values are identical.

proc sort data=raw_data out=nodup_data nodup;
  by _all_;
run;

🔍 Explanation:

_all_ tells SAS to consider all variables.
NODUP eliminates rows that are 100% duplicates.

2. 🔁 `NODUPKEY`

Removes duplicates based on the BY variable(s). If multiple rows share the same value(s) in the BY variable(s), only the first occurrence is kept.

proc sort data=raw_data out=unique_keys nodupkey;
  by customer_id;
run;

🔍 Explanation:

Keeps only one record per customer_id.
Other columns (e.g., name, email) are not considered.

📊 Example Dataset

customer_id name purchase
101 John 100
101 John 100
102 Alice 200
103 Bob 150
101 John 120

✅ Using `NODUP`:

proc sort data=shop out=cleaned_data nodup;
  by _all_;
run;

🧾 Output:

customer_id	name	purchase
101	John	100
101	John	120
102	Alice	200
103	Bob	150

☑️ Removes 1 identical row.

✅ Using `NODUPKEY`:

proc sort data=shop out=unique_ids nodupkey;
  by customer_id;
run;

🧾 Output:

customer_id	name	purchase
101	John	100
102	Alice	200
103	Bob	150

☑️ Keeps only the first entry per customer_id.

⚠️ Key Differences Between NODUP and NODUPKEY

Feature `NODUP` `NODUPKEY`
Compares All variables in the dataset Only BY variable(s)
Removes Fully identical rows Duplicate keys, keeps first row
Use case Exact duplicates Group-wise deduplication

Feature	`NODUP`	`NODUPKEY`
Compares	All variables in the dataset	Only BY variable(s)
Removes	Fully identical rows	Duplicate keys, keeps first row
Use case	Exact duplicates	Group-wise deduplication

🔧 Best Practices

Always sort by the variables you want to deduplicate.
Use `NODUPKEY` when you're only interested in the first occurrence of a key.
Use `NODUP` to eliminate true row-level duplicates.
Consider `FIRST.` or `LAST.` variables in a `DATA` step for more control.

✅ Real-world Use Case

Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer.

proc sort data=transactions out=first_purchase nodupkey;
  by customer_id;
run;

💡 This ensures you retain just one record per customer for reporting or analysis.

🔍 Conclusion

Removing duplicates is a fundamental step in data cleaning. With just a few lines of code, `PROC SORT` allows you to:

Organize your data
Clean unnecessary duplicates
Prepare your dataset for reliable analysis

Learning how to use `NODUP` and `NODUPKEY` will make your SAS data processing cleaner, faster, and more accurate.

🔄 PROC SORT in SAS: Sorting Data and Removing Duplicates Using NODUP and NODUPKEY Options

📌 What is PROC SORT in SAS?

✅ Basic Syntax:

🧹 Removing Duplicates with PROC SORT -

SAS provides two key options:

1. 🔁 `NODUP` (or `NODUPREC`)

Removes completely duplicate rows — i.e., observations where all variable values are identical.

proc sort data=raw_data out=nodup_data nodup;
`by _all_; run;`

🔍 Explanation:

`_all_` tells SAS to consider all variables.
`NODUP` eliminates rows that are 100% duplicates.

2. 🔁 `NODUPKEY`

📊 Example Dataset

customer_id name purchase
101 John 100
101 John 100
102 Alice 200
103 Bob 150
101 John 120

✅ Using `NODUP`:

proc sort data=shop out=cleaned_data nodup;
`by _all_; run;`

🧾 Output:

customer_id name purchase
101 John 100
101 John 120
102 Alice 200
103 Bob 150

☑️ Removes 1 identical row.

✅ Using `NODUPKEY`:

proc sort data=shop out=unique_ids nodupkey;
`by customer_id; run;`

🧾 Output:

customer_id name purchase
101 John 100
102 Alice 200
103 Bob 150

☑️ Keeps only the first entry per customer_id.

⚠️ Key Differences Between NODUP and NODUPKEY

Feature `NODUP` `NODUPKEY`
Compares All variables in the dataset Only BY variable(s)
Removes Fully identical rows Duplicate keys, keeps first row
Use case Exact duplicates Group-wise deduplication

🔧 Best Practices

Always sort by the variables you want to deduplicate.
Use `NODUPKEY` when you're only interested in the first occurrence of a key.
Use `NODUP` to eliminate true row-level duplicates.
Consider `FIRST.` or `LAST.` variables in a `DATA` step for more control.

✅ Real-world Use Case

Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer.

proc sort data=transactions out=first_purchase nodupkey;
`by customer_id; run;`

💡 This ensures you retain just one record per customer for reporting or analysis.

🔍 Conclusion

Post a Comment

0 Comments

SAS >>

Search

Popular Posts

About SAS – Everything You Need to Know About SAS ProgrammingMarch 28, 2020

📘 Commonly Used Terms in SAS: A Beginner-Friendly GlossaryMarch 30, 2020

Creating Groups Using PROC FORMAT | Real Time Problems January 14, 2021

Visualization >>

Footer Menu Widget

🔄 PROC SORT in SAS: Sorting Data and Removing Duplicates Using NODUP and NODUPKEY Options

📌 What is PROC SORT in SAS?

✅ Basic Syntax:

🧹 Removing Duplicates with PROC SORT -

SAS provides two key options:

1. 🔁 NODUP (or NODUPREC)

Removes completely duplicate rows — i.e., observations where all variable values are identical. proc sort data=raw_data out=nodup_data nodup; by _all_; run; 🔍 Explanation: _all_ tells SAS to consider all variables.NODUP eliminates rows that are 100% duplicates.

2. 🔁 NODUPKEY

📊 Example Dataset

customer_idnamepurchase101John100101John100102Alice200103Bob150101John120

✅ Using NODUP:

proc sort data=shop out=cleaned_data nodup; by _all_; run; 🧾 Output: customer_idnamepurchase101John100101John120102Alice200103Bob150 ☑️ Removes 1 identical row.

✅ Using NODUPKEY:

proc sort data=shop out=unique_ids nodupkey; by customer_id; run; 🧾 Output: customer_idnamepurchase101John100102Alice200103Bob150 ☑️ Keeps only the first entry per customer_id.

⚠️ Key Differences Between NODUP and NODUPKEY

FeatureNODUPNODUPKEYComparesAll variables in the datasetOnly BY variable(s)RemovesFully identical rowsDuplicate keys, keeps first rowUse caseExact duplicatesGroup-wise deduplication

🔧 Best Practices

Always sort by the variables you want to deduplicate.Use NODUPKEY when you're only interested in the first occurrence of a key.Use NODUP to eliminate true row-level duplicates.Consider FIRST. or LAST. variables in a DATA step for more control.

✅ Real-world Use Case

Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer. proc sort data=transactions out=first_purchase nodupkey; by customer_id; run; 💡 This ensures you retain just one record per customer for reporting or analysis.

🔍 Conclusion

You may like these posts

Post a Comment

0 Comments

SAS >>

Search

Contacts [Social Media]

Popular Posts

About SAS – Everything You Need to Know About SAS ProgrammingMarch 28, 2020

📘 Commonly Used Terms in SAS: A Beginner-Friendly GlossaryMarch 30, 2020

Creating Groups Using PROC FORMAT | Real Time Problems January 14, 2021

Visualization >>

Footer Menu Widget

Social Footer Widget

1. 🔁 `NODUP` (or `NODUPREC`)

Removes completely duplicate rows — i.e., observations where all variable values are identical.

proc sort data=raw_data out=nodup_data nodup;
`by _all_; run;`

🔍 Explanation:

`_all_` tells SAS to consider all variables.
`NODUP` eliminates rows that are 100% duplicates.

2. 🔁 `NODUPKEY`

customer_id name purchase
101 John 100
101 John 100
102 Alice 200
103 Bob 150
101 John 120

✅ Using `NODUP`:

proc sort data=shop out=cleaned_data nodup;
`by _all_; run;`

🧾 Output:

customer_id name purchase
101 John 100
101 John 120
102 Alice 200
103 Bob 150

☑️ Removes 1 identical row.

✅ Using `NODUPKEY`:

proc sort data=shop out=unique_ids nodupkey;
`by customer_id; run;`

🧾 Output:

customer_id name purchase
101 John 100
102 Alice 200
103 Bob 150

☑️ Keeps only the first entry per customer_id.

Feature `NODUP` `NODUPKEY`
Compares All variables in the dataset Only BY variable(s)
Removes Fully identical rows Duplicate keys, keeps first row
Use case Exact duplicates Group-wise deduplication

Always sort by the variables you want to deduplicate.
Use `NODUPKEY` when you're only interested in the first occurrence of a key.
Use `NODUP` to eliminate true row-level duplicates.
Consider `FIRST.` or `LAST.` variables in a `DATA` step for more control.

Scenario: You have e-commerce transactions with repeated customer entries. You want only the first transaction per customer.

proc sort data=transactions out=first_purchase nodupkey;
`by customer_id; run;`

💡 This ensures you retain just one record per customer for reporting or analysis.