- In this article we will solve Duplicate Job Listing SQL question which is asked by LinkedIn as per DataLemur website.
If you don’t know DatLemur then please do visit , it’s one of the best website to practice SQL questions and improve your SQL skills.
- We have been provided with input table job_listings that contains job post by each company.
- But some companies posted duplicate listing.
- Our job is to find out how many companies made that mistake.
- In input table we can see , job id 945 and job id 164 has same description and title and posted by same company , hence our output count is 1.
- There are many ways to find the duplicate in the table , but for this problem i will use window function called row_number().
- Window function create windows over defined partition logic , so lets say in our use case we partition it in multiple windows based on company id, so all the listing within same company will become single window, and now we can perform various operation over that window.
- By default entire table is single window and we can partition table into multiple windows based on some logic.
- Below query partition the input table based on company_id, title and description. We chose these fields because they make each entry a candidate for duplicate.
- If company x has two duplicate entries then they would have same title, description and company_id.
select *, row_number() over(partition by company_id,title,description order by job_id) rn from job_listings
- Our one of the partition is underlined with red in above image.
- Now we have our partition ready, we will perform row_number operation that will give unique number to each record in one partition. for new window it will again start from zero.
- Since if we have two same job listing then we will have records with row number greater than 1. For distinct job entry row number will always be 1 because partition size will always be 1.
select count(distinct company_id) from ( select *, row_number() over(partition by company_id,title,description order by job_id) rn from job_listings ) x where rn >1
- we just count incident where row number is greater 1 , that would be our output.
- Our sql code gets accepted on the platform.
- In this article we solve the duplicate entry issue, although there are many ways to deduplicate input data , but in this example we used windowing and partition approach and then ranked each record to find out potential duplicates .