19 Nov 2022

71

Current Challenges and Future Directions of Data Mining

Format: APA

Academic level: Master’s

Paper type: Term Paper

Words: 3307

Pages: 12

Downloads: 0

Data mining is a powerful technology of extracting hidden predictive data from large databases. As a process, it entails the discovery of interesting knowledge such as associations, patterns, anomalies, changes, and other vital structures from huge amounts of stored data (Jaseena & Julie, 2014). Formally, it is described as the non-trivial extraction of implicit previously unknown and potentially useful information about data (Jaseena & Julie, 2014). Large volumes of raw data contain interesting relationships and patterns. The chief objective of data mining is, therefore, to unravel these underlying patterns and relationships. Big data mining denotes the process of extracting valuable information from huge data sets, which was not possible in the past due to variety, volume, and velocity. In this regard, this paper brings to the fore the current issues, future directions, and challenges associated with data mining. 

Mined data is often beneficial as it stands as a representation of different kinds of patterns each corresponding to knowledge. As such, analyzing data from multiple approaches and making valuable summaries helps in the development of business and health-related solutions as well as in forecasting likely trends in the future (McFarland, 2014). Thanks to such technological advances, business and governments can make knowledge-driven decisions. As Verma & Nashine (2012) rightly observe, business questions that were traditionally time-consuming to answer can now be tackled since data mining tools allow entrepreneurs to scour through databases for concealed patterns, establishing predictive information that may have eluded experts since it lies beyond their expectations. 

It’s time to jumpstart your paper!

Delegate your assignment to our experts and they will do the rest.

Get custom essay

Additionally, many companies are already involved in the collection and refining of massive data quantities. Indeed, data mining techniques can be easily implemented on existing hardware and software systems to foster the use of already available data  (Ostherr, 2018). When implemented on parallel processing computers or high-performance servers, data mining tools can reveal answers to pressing business questions like which client is likely to respond to the promotional mailing. 

Foundation of Data Mining 

Data mining techniques have developed for a while now and are the product of rigorous research and product development. Particularly, it is an evolution that can be traced to when data started being stored on computers. Continuous improvements in data access and development of technologies that enable real-time navigation of data has served to propel this revolution. Data mining has today taken this progression past navigation and retrospective data access to proactive information delivery (Verma & Nashine, 2012). The application of data mining within the business community is ready thanks to advances in three technologies, that is, massive data collection, data mining algorithms, and powerful multiprocessor computers. 

Further, the growth in commercial databases is unprecedented with most companies being well beyond the 50 gigabytes while some industries have massively larger numbers. Notably, this change has been followed by improved computational engines that have in recent years become relatively affordable (Sharma & Kaur, 2013). As Verma & Nashine (2012) note, data mining algorithms, though having been in existence for over ten years, have advanced over time to represent mature, understandable, and reliable tools that outperform by all aspects older statistical methods. 

Specifically, large volumes of data are produced every minute. A recent survey estimates that every minute, over 200 million messages are sent via email, Google receives well over four million queries, 72 hours of video is uploaded on YouTube, 277,000 tweets are generated, and over 2 million content pieces are shared on Facebook (Verma & Nashine, 2012). Such operations would have been impossible some ten years ago. The increase in storage capacities, availability of large data sets, and increased processing power are the principal reasons behind this revolution in big data. The 3Vs, velocity, volume, and variety characterize this revolution. Volume denotes the fact that data size is now beyond petabytes and terabytes. The shift to large-scale data makes traditional tools of storage and analysis difficult. Velocity means that individuals can with speed and in a predetermined period mine vast amounts of data. Lastly, variety underscores the fact that big data comes from a multiplicity of sources including structured and unstructured data (Verma & Nashine, 2012). While big data is 3D and 4D data [geospatial, audio and video, and unstructured text that includes social media and log files], traditional database systems were meant to address small volumes of consistent and structured data. In this evolution from raw data to information knowledge, each new step has reinforced the previous one (Kumar, Amit, & Tyagi, 2014). 

Data Mining Concept 

Data mining employs comparatively large computing power on a huge data set to determine connections and regularities between data points. Algorithms that use methods from machine learning, patterns, and figures are employed to search large databases automatically. For this reason, data mining is also called Knowledge-Discovery in Databases (KDD,) just like the term artificial intelligence is an umbrella term applicable in multiple activities (Verma & Nashine, 2012). Within the corporate world, it is used to determine the trajectory of trends and predict future patterns. In addition, it is used to develop models and support decision making (Fan & Bifet). Various kinds of data software can be used for data mining. For small data, simple data mining software may be used while highly specific software may be employed for detailed and extensive tasks that require sifting through tonnes of information to arrive at finer information (Kapoor, 2014) . Through such information collection techniques, companies have made vast amounts of money by being able to employ the internet better to acquire business intelligence and thus improved business decision making (Verma & Nashine, 2012). Before the arrival of data mining techniques, companies had to sift through recorded data sources. However, the bulk of the information was often too much, making the process too costly, time-consuming, and fraught with errors (Verma & Nashine, 2012). As such, the advent and employ of data mining have meant a lifeline for many businesses. Data mining technology is, however, in a constant state of flux as new advances are made. Web data mining, for instance, is opening unimagined opportunities for gathering data as well as attendant concerns related to data security. 

Benefits of Data Mining 

Data mining delivers numerous benefits, especially to governments and businesses. To the former, enhanced information gathering is employed in planning, policy implementation, and evaluation. The era of policy decisions grounded in suspect data appears to have primarily been surpassed. Intelligence gathering has also been greatly enhanced. Moreover, terrorist and other high-level security threats are in many ways mitigated through the information gathering enabled by data mining though with strong concerns about privacy (Sharma & Kaur, 2013). For businesses, the benefits have been manifold. For instance, enterprises, regardless of size or industry, are [in the context of defined business objectives], able to automatically explore and understand their data. As such, they can develop patterns, dependencies, and relationships that impact business outcomes like risk management, profit improvement, and revenue growth (Verma & Nashine, 2012). In this regard, business intelligence benefits of data mining have a significant descriptive function. Nevertheless, they also play a powerfully important predictive role. Uncovered relationships are expressed in predictive models or business rules. Moreover, outputs are communicated in traditional reporting formats to guide planning and strategy (Kumar, Amit, & Tyagi, 2014). Underlying these presentational outputs are programming codes that can be deployed into business operating systems to produce predictions of future events based on recently generated data. 

Moreover, in logistics management, data mining aids in improving the efficiency of decisions, making appropriate sale decisions, reducing inventory costs, and analyzing the market and its trends decisions (Verma & Nashine, 2012). By integrating transport data with inventory, data appropriate cost-saving decisions are made. In addition, inventory can be increased or reduced based on the available predictions, thereby minimizing the traditional burden of misguided inventory. By mining the hidden value of information, large data can aid in the making of accurate and timely decisions. For product promoters, data mining offers prospects for predicting purchase trends, a model for targeting customers, and real-time accurate and dynamic information for promotion (Sharma & Kaur, 2013). The psychology of the customer is also understood as purchase patterns get analyzed. In brief, the importance of data mining in logistics management cannot be gainsaid. 

The Process of Data Mining 

Various stages constitute the process of data mining with the initial phase being the determination of its objective. The entire process is driven by goals set well in advance. Importantly, the needs of the organization inform the objectives. Since data mining is an elaborate and wide-ranging phenomenon, imprecise data mining is likely to yield unclear outcomes. The second step is to engage in data preparation. Particularly, this entails data selection, pre-processing, and data transformation. Notably, the quality of data preparation significantly impacts the validity and efficiency of data. As such, the preparation and transformation of data sets is the critical activity in this stage and will typically consume 60% of the total mining time (Verma & Nashine, 2012). Ultimately, the essential tasks in data preparation are to choose proper inputs and outputs, identify the data, and state the mining goal. 

The third step is to undertake the actual process of data mining. Various methods are employed at this stage including classification analysis, association analysis, cluster analysis, outlier analysis, sequence analysis, and time sequence among others (Verma & Nashine, 2012). Essentially, the emphasis is laid on ensuring an effective data mining algorithm is selected. The results of the mining process can either be descriptive or predictive knowledge. Subsequently, analysis, interpretation, and validation of the results are done. In addition, valuable rules and knowledge are identified and expressed in a manner that is understandable. Notably, visualization is one of the best techniques of transforming useful information into a comprehensible format (Akilan, 2015). Lastly, the assimilation of data is undertaken, which entails converting knowledge acquired from the data mining process into value through knowledge implementation in the business cycle (Verma & Nashine, 2012). In brief, therefore, the mining of big data is an elaborate process made of multiple and distinct phases. Each of these phases introduces its own set of challenges including scale, heterogeneity, timeliness privacy and complexity. 

Challenges of Data Mining 

Heterogeneity and Incompleteness 

One of the main difficulties with the analysis of big data is its large scale and the presence of mixed data due to different rules and patterns found within collected and stored data, that is, heterogeneous mixture data. In the event of complex heterogeneous mixture data, there are several rules and patterns whose properties vary considerably. Furthermore, data may be in structured or unstructured format. However, about 80% of organization generated data is unstructured, highly dynamic, and has no particular format. Moreover, it may be in the form of images, email attachments, medical records, X rays, voice mails, audios, graphics or PDF documents and, therefore, cannot be stored in column/row format like structured data (Verma & Nashine, 2012). As such, converting this data into a structured form for analysis is a key hurdle in any major data analysis (Jaseena & Julie, 2014). Ultimately, this necessitates the need for new technologies to aid in dealing with such data. 

In addition, incomplete data poses serious uncertainties for analysis and must be managed if a degree of accuracy is to be achieved. Of course, achieving this delicate balance is difficult. The term incomplete data describes missing of data field values for some samples (Verma & Nashine, 2012). Missing values are attributable to varying realities such as sensor node malfunction or policies deliberately crafted to skip some values (Jaseena & Julie, 2014). While most data mining algorithms have incorporated components to address missing values such as through ignoring missing values, data imputation has emerged as an important area of research seeking to impute missing values in order to engender improved models as opposed to relying on the ones constructed from the original data (Kumar, Amit, & Tyagi, 2014). Most imputation methods are meant for this purpose with the major approaches being to fill most observed values. Other methods seek to construct learning models for predicting possible values for every data field based on values observed in every instance. 

Scale and Complexity 

Besides the challenge of heterogeneous and incomplete data, rapidly increasing data volumes introduce additional difficulties. Traditional software tools are often inadequate for addressing ever-increasing data volumes (Gupta, Sunny, & Singhal, 2014). Data analysis, organization, modeling, and retrieval are also additional challenges due to the complexity and scalability of data-demanding analysis. 

Timeliness 

As heterogeneity, incompleteness, and size of data to be processed become important concerns, the timeliness of data mining becomes a crucial matter. Indeed, this is a pressing issue, especially when situations demand immediate results (Jaseena & Julie, 2014). For instance, if a fraudulent transaction with say a credit card is suspected, it ought to be flagged before completion of the transaction thus preventing illegality from taking place. Certainly, it is not feasible to have a real-time and full analysis of the user’s history. Therefore, partial results have to be developed well in advance so that incremental computation with new data can be employed to reach a quick determination (Verma & Nashine, 2012). When working with large sets of data, it has become increasingly important to discover patterns within it that can meet a particular criterion (Fan & Bifet). In the process of data analysis, this kind of search becomes common. Since scanning a whole data set to establish suitable elements is impractical, index structures are created to allow the finding of qualifying elements quickly (Jaseena & Julie, 2014). Challenges, however, remain since every index structure is meant to support only specific classes of criteria. 

Security and Privacy Challenges 

As data processing applications have grown in capacity, data sources have expanded dramatically. Big data [datasets outside the ability of commonly used software such as traditional data processing applications or database management tools] is steadily growing, ranging from a few dozen terabytes in 2012 to many petabytes of data in a single data set in 2018 (Jaseena & Julie, 2014). Big data creates enormous opportunities for the economy not only in the areas of security but also in other places like credit risk analysis, marketing, and research (Gupta, Sunny, & Singhal, 2014). These extraordinary benefits are, however, significantly lessened by real concerns over data protection and privacy. 

Further, as big data grows in size, its trustworthiness becomes suspect. As such, techniques have to be developed occasionally to explore large data and identify maliciously inserted information. Information security is today considered the most pressing problem for big data analytics where large amounts of data have to be analyzed, correlated, and mined to establish meaningful patterns (Verma & Nashine, 2012). Security threats may take multiple forms such as an unauthorized person eavesdropping to data packets being sent to the client or unauthorized person gaining access privileges. In such a case, the unauthorized user can, for instance, submit the information to the system, write or read a data block of a file (Jaseena & Julie, 2014). Given the potentially far-reaching impacts of security breaches, data mining organizations have enhanced their authorization, authentication, encryption, and audit trails. While security breaches are always a looming possibility, these methods have to be continually improved. Some of these methods include authentication measures, use of file encryption and the use of key management, logging, and use of secure communication. 

Notably, authentication techniques entail the process of verifying system or user identity before access. Additionally, file encryption, privacy, and confidentiality of user information play a significant role in securing sensitive data. Indeed, encryptions protect data should malicious administrators or users gain access and seek to inspect files. In addition, encryption renders copied or stolen files or images unreadable (Kumar, Amit, & Tyagi, 2014). Encryption often works as an effective method of protecting data since consistent protection is offered across different platforms regardless of platform/OS type (Jaseena & Julie, 2014). Nonetheless, the threat remains that expert hackers may find their way through encryptions. 

Also, the implementation of access control entails specifying access control privileges for system or user to enhance security. Logging is used to detect attacks, investigate unusual behavior or diagnose failures. Since many web companies begin with big data, especially in managing log files, logging offers a place to look in case things fail or there is suspicion of having been hacked. The audit of entire systems is also significantly enhanced through having logging systems since individual activity can be traced (Kapoor, 2014). Secure communications between nodes and applications or between nodes have been implemented in many companies but most notably by government security agencies. Even though these systems have been effective in many instances, there are occasions of failure or breach leading to major security concerns (Naughton, 2017). These breaches have led to major public fear and outcry about privacy and inappropriate use of personal data. 

Consequently, two dominant approaches have been adopted to protect the privacy of users. The primary strategy has been to restrict data access by adding certification or controlling access to data so that sensitive data remain only accessible to a limited number of authorized users. The other approach has been to anonymize data fields so that sensitive data cannot be traced to an individual record (Jaseena & Julie, 2014). For both data anonymization and restricting control, there are still risks that are yet to be overcome meaning they are not 100% tamper proof. Despite these genuine concerns, there is no doubt that the ability to mine big data is a major trend in corporate and national importance that is set to present more breakthroughs in the future. Indeed, as researchers develop novel insights and essential applications, these breakthroughs are beginning to take shape. 

Future Directions 

Transfer Learning Challenge 

Data mining can bring breakthroughs in transfer learning. Transfer learning aims to get knowledge from one or more auxiliary task to help discover key patterns in a different though related target task (Yang, 2010). In transfer learning, data from auxiliary or source domains and its application in targeted domains may take different distributions or may be represented with varying features. Essentially, this problem is of interest because it runs against the core assumptions of traditional data mining and machine learning that data ought to come from the same space. In addition, this learning challenge is useful in novel data mining domains where there exists limited annotated or labeled data to aid individuals to develop a credible model where technical or budget issues constrain the ability to get high or new quality data labels. As such, it is this difficult situation that forces people to search elsewhere to establish auxiliary sources of data close to the target domain (Yang, 2010). In brief, the ‘how’ of mining useful knowledge from auxiliary sources of data even though seemingly different, is a critical challenge for data mining today. Ultimately, the resolution of this challenge could lead to transfer learning. 

Additionally, transfer learning can, for instance, be employed in wireless indoor location estimation. The past few years have seen a proliferation of wireless technology, which has led to geolocation data like WIFI and GPS becoming available. Indeed, the ability to exploit WIFI data for localization purposes has numerous advantages over other sensors since WIFI is pervasive and cheap, being available in both indoor and outdoor settings. If successful WIFI location prediction models can be built, then it is possible to create higher level applications ranging from logistics monitoring to health care. The machine learning indoor localization model assumes that learning occurs in two phases, online and offline phases (Yang, 2010). In the latter, labeled training data is used to train a local prediction model, while in the former, the localization model acquired in the latter is employed to locate mobile device online. Consequently, this two-phase flow assumes that data distribution is stationary. However, this may not hold in numerous real-world situations, for instance, because data distribution is a function of space and time or client devices. To this end, models that are trained or meant for one kind of device (Cisco Aironet) might become invalid in another device (Huawei E5830) (Yang, 2010). Transfer learning thus helps in doing away with recalibration problems through multi-task learning in which every task is treated as both a source and a target task. The same techniques can be used in other areas characterized by expense and time consumption like bioinformatics. 

Data mining also promises breakthroughs in social learning challenges. Social media and social networks are transforming computation and social landscape with services like Twitter, Flickr, and Facebook allowing millions of their users to go online and share information (Paidi, 2012). To this end, data mining can enable individuals to discover the underlying dynamics of this data as it evolves as well as permit them to employ individually sparse data to make more accurate predictions on user connection to other users, communities or products (Yang, 2010). Importantly, data mining can allow the mixing of the crowd, human, and computing power to create more robust computation models of user behavior. 

Another fascinating area of data mining attracting considerable attention is that of collective or distributed data mining. Much of the data mining that is currently taking place focuses on data or database warehouse of information that is physically in the same location (Paidi, 2012). Situations, however, arise where information is in different physical spaces, what is known as distributed data mining. The goal, therefore, is to develop techniques for effective mining of data found in the heterogeneous locations (Paidi, 2012). Examples include biological information found in different databases, data that comes from different firms or analytical data from different departments of a firm. Indeed, the combining of such data would be time-consuming and expensive. To this end, distributed data mining offers a different approach to traditional analysis through a combination of the global data model with a localized analysis. 

Ubiquitous data mining (UDM) is also another area likely to receive a lot of attention. The advent of palmtops, laptops, wearable computers, and cell phones is revolutionizing access to big data (Fan & Bifet). Advanced analysis of big data is thus the next natural step in this world. However, access and interpretation of data from ubiquitous computing devices are fraught with challenges. For instance, UDM introduces more cost due to computation, communication, and security. A key objective of UDM, therefore, will have to be mining of data while lowering the cost linked to ubiquitous presence. Another significant challenging aspect of UDM is the human-computer interaction (Paidi, 2012). Additionally, the psychological and sociological aspects of integration between lifestyle and data mining technology are yet to be explored. 

Conclusion 

Data mining is one of the most consequential technological developments of our time. Already, its potential in business and national well-being is evident. The paper has explored the concept of data mining in popular and technical parlance and subsequently addressed data mining trends and applications. In addition, existing challenges are confronted and possible solutions explored. The paper ends by examining likely future directions in data mining that promise breakthroughs. 

References 

Akilan, A. (2015). Text mining: Challenges and future directions. Research Gate , DOI: 10.1109/ECS.2015.7124872. 

Fan, W., & Bifet, A. (n.d.). Mining big data: Current status, and forecast to the future. SIGKDD Explorations, 14 (2). 

Gupta, R., Sunny, G., & Singhal, A. (2014). Big data: Overview. IJCTT, 9 (5). 

Jaseena, K. U., & Julie, D. (2014). Issues, challenges and solutions: Big data mining. Conference Paper

Kapoor, A. (2014). Data mining: Past, present and future scenario. International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), 3 (1), 95-99. 

Kumar, A., Amit, K. T., & Tyagi, S. K. (2014). Data mining: Various issues and challenges for future. International Journal of Emerging Technology and Advanced Engineering, 4 (1). 

McFarland, M. (2014). The incredible potential and dangers of data mining health records. The Washington Post , https://www.washingtonpost.com/news/innovations/wp/2014/10/01/the-incredible-potential-and-dangers-of-data-mining-health-records/?utm_term=.4f89f88494da. 

Naughton, J. (2017). Why Facebook is in a hole over data mining. The Guardian , https://www.theguardian.com/commentisfree/2017/oct/08/facebook-zuckerberg-in-a-hole-data-mining-business-model. 

Ostherr, K. (2018). Facebook knows a ton about your health. Now they want to make money off it. Washington Post , https://www.washingtonpost.com/news/posteverything/wp/2018/04/18/facebook-knows-a-ton-about-your-health-now-they-want-to-make-money-off-it/?utm_term=.06dd1d41d784. 

Paidi, A. N. (2012). Data mining: Future trends and applications. IJMER , 4657-4663. 

Sharma, B. R., & Kaur, D. (2013). A review on data mining: its challenges, issues and applications. International Journal of Current Engineering and Technology , 695-700. 

Verma, D., & Nashine, R. (2012). Data mining: Next generation challenges and future directions. International Journal of Modeling and Optimization, 2 (5), 603-609. 

Yang, Q. (2010). Three challenges in data mining. Front. Comput. Sci., 4 (3), 324-333.  

Illustration
Cite this page

Select style:

Reference

StudyBounty. (2023, September 14). Current Challenges and Future Directions of Data Mining.
https://studybounty.com/current-challenges-and-future-directions-of-data-mining-term-paper

illustration

Related essays

We post free essay examples for college on a regular basis. Stay in the know!

Security Implication of the Internet of Things

The Internet of Things (IoT) can be described as s system of interconnected devices that have the ability to transfer information over a computer network without the need of human-to-computer or human-to-human...

Words: 892

Pages: 3

Views: 96

Modern Day Attacks Against Firewalls and VPNs

Introduction The need to have an enhanced security of the computer connectivity happens to be one of the reasons that attract companies and organizations towards wide usage of VPNs. Several simple techniques...

Words: 2025

Pages: 7

Views: 134

How to Deploy and Administer Windows Server 2012

Securing a reliable, and expandable configuration for a company is important to build a strong network. The new and enhanced features of the Windows Server 2012 can be used to implement the network. In this...

Words: 1673

Pages: 6

Views: 87

Deployment Model in Cloud Computing

Deployment model is a representation of a cloud environment primarily distinguished by parameters such as accessibility, proprietorship, and storage size. The National Institute of Standards and Technology gives the...

Words: 254

Pages: 1

Views: 82

How to Use Web Search Engines for Business Research

The advancement of technology has made it possible for many people around the world to have easy access to information whenever they want. The development of the Wide World Web-enabled different kinds of information...

Words: 773

Pages: 3

Views: 87

Distributed Database Management System (DDBMS)

Introduction Data management has been a headache to many technology enthusiasts for quite a long period of time. They have successfully managed to logically collect interrelated data and share it. If the data is...

Words: 799

Pages: 3

Views: 128

illustration

Running out of time?

Entrust your assignment to proficient writers and receive TOP-quality paper before the deadline is over.

Illustration