Comparative Study of Record Linkage Approaches for Big Data

Randa  MOHAMED; Ali  EL-BASTAWISSY; Eman  NASR; Mervat  GHEITH

doi:10.48048/wjst.2021.7221

Authors

Randa MOHAMED Computer Science, Faculty of Graduate Studies for Statistical Research, Giza, Egypt https://orcid.org/
Ali EL-BASTAWISSY Computer Science, MSA University, Egypt
Eman NASR Independent Research, Egypt
Mervat GHEITH Computer Science, Faculty of Graduate Studies for Statistical Research, Giza, Egypt

DOI:

https://doi.org/10.48048/wjst.2021.7221

Keywords:

Big Data, Flink, Record linkage, Hadoop, MapReduce, Spark

Abstract

Record linkage is a challenging task for Big Data. This paper, hence, attempts to shed light on record linkage approaches for Big Data by comparing three dimensions involving record linkage phases, dataset properties, and parallel processing approach for Big Data. The current state of art have only conducted comparative studies between record linkage approaches. There has been only one comparative study exploring the whole record linkage framework of the relational database. It is believed that the focus of the present study on the dimensions of the parallel processing approaches for Big Data and dataset properties was worth exploring. It was found that first, data exploration was almost a non-existing phase despite its importance of exploring the dataset being examined; second, techniques that handle data standardization and preparation phase of the first dimension were not extensively covered in the literature which can directly affect the results’ quality; third, the record linkage in unstructured data was not yet explored in literature; fourth, the MapReduce was used in about 50 % of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches had been used, such as Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study includes many recent studies supporting Apache Spark, adopting Apache Spark to solve the problem of record linkage is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques, used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space resulting in a more effective Record Linkage process.

Downloads

Download data is not yet available.

References

J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh and AH Byers. Big data: The next frontier for innovation, competition, and productivity. Available at: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation, accessed July 2012.

E Dumbill. What is Big Data? Available at: http://radar.oreilly.com/2012/01/what-is-big-data.html, accessed July 2012.

TP Gondaliya, HD Joshi and H Joshi. New big things in era of digital data: Big data & big data challenges with its solution using different tools. In: Proceedings of the 10th International CALIBER, Shimla, Himachal Pradesh, India, 2015, p. 496-507.

AJ Loughrey and P Deepak. Semi-supervised and Unsupervised Approaches to Record Pairs Classification. Multi-source Data Linkage, Springer, Cham, 2019, 55-78.

F Castanedo. Data Preparation in the Big Data Era. O’Reilly Media, USA 1005 Gravenstein Highway North, Sebastopol, CA, 2015.

XL Dong and D Srivastava. Big Data integration. In: Proceedings of the 2013 IEEE 29th International Conference on Data Engineering, Brisbane, QLD, Australia, 2013.

H Köpcke and E Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng. 2010; 69, 197-210.

R Mothukuri, M Nagaraju and D Chilukuri. Similarity measures for text classification. Int. J. Emerg. Trends Tech. Comput. Sci. 2016; 5, 16-24.

RMA El-Ghafar, MH Gheith, AH El-Bastawissy and ES Nasr. Record linkage approaches in big data: A state of. In: Proceedings of the 13th International Computer Engineering Conference, Cairo, Egypt, 2017.

WW Cohen, P Ravikumar and SE Fienberg. A comparison of string distance metrics for name-matching tasks. In: Proceedings of the International Joint Conference on Artificial Intelligence. American Association for Artificial Intelligence, 2003, p. 73-8.

J Mielke. A phonetically-based phonetic similarity metric. In: Proceedings of the 40th Meeting of the North East Linguistic Society, MIT, 2009.

S Schumacher. Probabilistic Versus Deterministic Data Matching: Making an Accurate Decision. Information Management Special Reports, 2007.

MG Elfeky, VS Verykios, AK Elmagarrnid, TM Ghanem and AR Huwait. Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service. Technical Report CSD-TR, 2003.

J Domingo-Ferrer and V Torra. Distance-based and probabilistic record linkage for re-identification of records with categorical variables. Butlletí de l’ACIA 2002; 2002, 27.

A H Yousef. Cross Language Duplicate Record Detection in Big Data. In: Big Data in Complex Systems, Springer, Cham, 2015.

J Dean and S Ghemawat. MapReduce: Simplified data processing on large clusters. In: Proceeding of the 6th conference on Symposium on Operating Systems Design & Implementation, Berkeley, CA, USA. 2004.

AG Shoro and TR Soomro. Big Data Analysis: Apache Spark Perspective. Global J. Comput. Sci. Tech. C Software Data Eng. 2015; 15, 6-14.

R Pita, C Pinto, P Melo, M Silva, M Barreto and D Rasella. A Spark-based workflow for probabilistic record linkage of healthcare data. In: Workshop on Algorithms and Systems for MapReduce and Beyond, Brussels, 2015.

P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl and K Tzoumas. Apache flink: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 2015; 36, 28-38.

M Sariyar and A Borg. The RecordLinkage Package: Detecting Errors in Data. The R J. 2010; 2, 61-7.

S Venkataraman, Z Yang, D Liu, E Liang, H Falaki, X Meng, R Xin, A Ghodsi, M Franklin, I Stoica and M Zaharia. SparkR: Scaling R programs with spark. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2016.

XL Dong and D Srivastava. Big Data Integration. Morgan & Claypool Publishers, 2015, p. 1-198.

C Bargavi and MBP Blessa. Data linkage for big data using Hadoop MapReduce. Int. J. Comput. Sci. Technol. 2015; 6, 93-5.

M Kejriwal. Entity resolution in a big data framework. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, Texas, 2015.

S Xu, S Flexner and V Carvalho. Geocoding Billions of Addresses: Towards a Spatial Record Linkage System with Big Data. GIScience in the Big Data Age, Columbus, Ohio, 2012.

V Efthymiou, K Stefanidis and V Christophides. Big Data Entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web. IEEE Big Data, 2015.

DM Pham and TLX Vu. 2015, ELODU: Entity Resolution in Big Data. Ph. D. Dissertation, Worcester Polytechnic Institute.

V Efthymiou, K Stefanidis and V Christophides. Minoan ER: Progressive Entity Resolution in the Web of Data. In: Proceedings of the 19th International Conference on Extending Database Technology. Bordeaux, France, 2016.

SP Benny, SD Vasavi and P Anupriya. Hadoop framework for entity resolution within high velocity streams. In: Proceedings of the International Conference on Computational Modeling and Security, 2016.

PA Albanese and JM Ale. Data Matching and Deduplication over Big Data using Hadoop Framework. El Servicio de Difusión de la Creación Intelectual, 2016.

L Kolb, A Thor and E Rahm. Dedoop: Efficient Deduplication with Hadoop. Proceed. VLDB Endow. 2012; 5, 1878-881.

L Gagliardelli, G Simonini, D Beneventano and S Bergamaschi. SparkER: Scaling Entity Resolution in Spark. In: Proceedings of the 22nd International Conference on Extending Database Technology, Lisbon, Portugal, 2019.

L Gagliardelli, S Zhu, G Simonini and S Bergamaschi. Bigdedup: A Big Data integration toolkit for duplicate detection in industrial scenarios. In: Proceedings of the 2018 on Transdisciplinary Engineering, Modena, 2018.

G Simonini and S Bergamaschi. BLAST: A Loosely Schema-aware Meta-blocking approach. Proceed. VLDB Endow. 2016; 9, 1173-84.

G Papadakis, G Koutrika, T Palpanas and W Nejdl. Meta Blocking: Taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 2014; 26, 1946-60.

D Beneventano, S Bergamaschi, L Gagliardelli and G Simonini. Entity resolution and data fusion: An integrated approach. In: Proceedings of the Symposium on Advanced Database Systems, Italy, 2019.

MG Chen and HJ Sui. Parallel entity resolution with Apache spark. In: Proceedings of the International Conference on Electronic, Control, Automation and Mechanical Engineering, 2018.

DG Mestre, CES Pires, DC Nascimento, AR Queiroz, VB Santos and TB Araujoa. An efficient Spark-based adaptive windowing for entity matching. J. Syst. Software 2017, 128, 1-10.

C Wang and S Karimi. Parallel duplicate detection in adverse drug reaction databases with Spark. In: Proceedings of the19th International Conference on Extending Database Technology, Bordeaux, France, 2016.

X Chen, R Zoun, E Schallehn, S Mantha, K Rapuru and G Saake. Exploring Spark-SQL-based entity resolution using the persistence capability. In: Proceedings of the International Conference: Beyond Databases, Architectures and Structures. Cham, 2018.

M Nentwig, A Groß, M Moller and E Rahm. Distributed holistic clustering on linked data. In: Proceedings of the 2017 Move to Meaningful Internet Systems. Springer, Cham, 2017.

A Saeedi, M Nentwig, E Peukert and E Rahm. Scalable matching and clustering of entities with FAMER. Complex Syst. Inform. Model. Quart. 2018, 16. 61-83.

JJ Feigenbaum. A Machine Learning Approach to Census Record Linking. Working Paper, 2016.