Different Data Mining Approaches Based Medical Text Data

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Associated Data

No data were used to support this study.

Abstract

The amount of medical text data is increasing dramatically. Medical text data record the progress of medicine and imply a large amount of medical knowledge. As a natural language, they are characterized by semistructured, high-dimensional, high data volume semantics and cannot participate in arithmetic operations. Therefore, how to extract useful knowledge or information from the total available data is very important task. Using various techniques of data mining can extract valuable knowledge or information from data. In the current study, we reviewed different approaches to apply for medical text data mining. The advantages and shortcomings for each technique compared to different processes of medical text data were analyzed. We also explored the applications of algorithms for providing insights to the users and enabling them to use the resources for the specific challenges in medical text data. Further, the main challenges in medical text data mining were discussed. Findings of this paper are benefit for helping the researchers to choose the reasonable techniques for mining medical text data and presenting the main challenges to them in medical text data mining.

1. Introduction

The era of big data is coming with the mass of data growing at an incredible rate. The concept of big data for the first time was put forward in the 11th EMC World conference in 2011, which refers to large-scale datasets that cannot be captured, managed, or processed by common software tools. With the arrival of big data age, the amount of medical text data is increasing dramatically. Analyzing this immense amount of medical text data to extract the valuable knowledge or information is useful for decision support, prevention, diagnosis, and treatment in medical world [1]. However, analyzing the huge amount of multidimensional or raw data is very complicated and time-consuming task. Data mining has capabilities for this matter.

Data mining is a methodology for discovering the novel, valuable, and useful information, knowledge, or hidden pattern from enormous datasets by using various statistical approaches. Data mining is with many advantages in contrast to the traditional model for transforming data to knowledge with some manual analysis and interpretation. Data mining approaches are quicker, favorable, time-saving, and objective. Summarizing various data mining approaches in medical text data for clinical applications is essential for health management and medical research.

This paper is organized in four sections. Section 2 presents the concepts of medical text data. Section 3 includes data mining approaches and its applications in medical text data analysis. Section 4 concludes this paper and presents the future works.

2. Medical Text Data

The diversity of big data is inseparable from the abundance of data sources. Medical big data including experimental data, clinical data, and medical imaging data are increasing with the rapid development of medicine. Medical big data are the application of big data in the medical field after the data related to human health and medicine have been stored, searched, shared, analyzed, and presented in innovative ways [2]. Medical text data are an important part of medical big data which are described in natural language, cannot participate in an arithmetic operation, and are characterized by semistructured, high-dimensional, high data volume semantics [3]. They cannot be well applied in research owing to no fixed writing format and being highly professional [4]. Medical text data contain clinical data, medical record data, medical literature data, etc., and this type of data records the progress of medicine and implies a large amount of medical knowledge. However, utilizing human power to extract the facts of relationships between entities from a vast amount of medical text requires time-consuming efforts. With the development of data mining technology, data mining technology used for medical text to discover the relationships in medical text becomes the hot topic. Medical text data mining is able to assist the discovery of medical information. In the COVID-19 research field, medical text mining can help decision-makers to control the crown outbreak by gathering and collating scientific basic data and scientific research literature related to the new crown virus, predicting the susceptible population to new crown pneumonia, virus variability, and potential therapeutic drugs [5–8].

3. Medical Text Data Mining

Data mining was defined in the “First section of the 1995 International Conference on Knowledge Discovery and Data Mining,” which has been widely used in disease auxiliary diagnosis, drug development, hospital information system, and genetic medicine to facilitate the medical knowledge discovery [9–12]. Data mining used to process medical text data can be divided into four steps: data collection, data processing, data analysis, and data evaluation and interpretation. This study summarized the algorithms and tools for medical text data based on the four steps of data mining.

3.1. Data Preparation

Medical text data include electronic medical records, medical images, medical record parameters, laboratory results, and pharmaceutical antiquities according to the different data sources. The different data were selected based on the data mining task and stored in the database for further processing.

3.2. Data Processing

The quality of data will affect the efficiency and accuracy of data mining and the effectiveness of the final pattern. The raw medical text data contain a large amount of fuzzy, incomplete, noisy, and redundant information. Taking medical records as an example, the traditional paper-based medical records have many shortcomings, such as nonstandard terms, difficult to form clinical decision-making support, scattered information distribution, and so on. After the emergence of electronic medical records, the medical records data are gradually standardized [13]. However, the electronic medical records still as natural language are difficult for data mining. Therefore, it is necessary to clean up and filter the data to ensure data consistency and certainty by removing missing, incorrect, noisy, and inconsistent or no quality data.

Missing values in medical text data are usually handled by deletion and interpolation. Deletion is the easiest method to handle, but some useful information is lost. Interpolation is a method that assigns reasonable substitution values to missing values through a specific algorithm. At present, many algorithms have emerged in the process of data processing. Multiple imputation, regression algorithm, and K-nearest neighbors are often used to supplement missing values in medical text data. The detail algorithm information is shown in Table 1 . In order to further understand the semantic relationships of medical texts, researchers have used natural language processing (NLP) techniques to perform entity naming, relationship extraction, and text classification operations on medical text data with good results [19].

Table 1

The detailed algorithm information for missing values in medical text data.

AlgorithmPrinciplePurpose
Multiple imputation [14, 15]Estimate the value to be interpolated, and add different noises to form multiple groups of optional interpolation values; select the most appropriate interpolation value according to a certain selection basis.Repeat the simulation to supplement the missing value
Expectation maximization [16]Compute maximum likelihood estimates or posterior distributions with incomplete data.Supplement missing values
K-nearest neighbors [17, 18]Select its K closest neighbors according to a distance metric and estimate missing data with the corresponding mode or mean.Estimate missing values with samples

3.2.1. Natural Language Processing

Natural Language Processing (NLP) as a subfield of artificial intelligence is mainly used for Chinese word segmentation, part-of-speech tagging, parsing, natural language generation, text categorization, information retrieval, information extraction, text-proofing, question answering, machine translation, automatic summarization, and textual entailment with the advantage of the fast process and lasting effect. It affirms positive motivation without negative influence, which can effectively stimulate potential, keep learning, keep growing, and keep developing [20].

In medical text processing, NLP is often used for information extraction and entity naming including word segmentation, sentence segmentation, syntactic analysis, grammatical analysis, and pragmatic analysis. The schematic of natural language processing is shown in Figure 1 . Kou et al. [21] used NLP tools to extract important disease-related concepts from clinical notes, form a multichannel processing method, and improve data extraction ability. Jonnagaddala et al. [22] proposed a hybrid NLP model to identify Framingham heart failure signs and symptoms from clinical notes and electronic health record (EHR). Trivedi et al. [23] designed an interactive NLP tool to extract information from clinical texts, which can serve clinicians well after evaluation. Datta et al. [24] evaluated the NLP technology to extract cancer information from EHR, summarized the implementation functions of each framework, and found many repetitive parts in different NLP frameworks resulting in a certain waste of resources. The possibility of diversified medical text data will also bring the transformation of medical data analysis mode and decision support mode. Roberts and Demner-Fushman [25] manually annotated tags on 468 electronic medical records to generate a corpus, which provided corpus support for medical data mining. The development of NLP technology greatly reduces the difficulty of manual data processing in data mining. Shikhar Vashishth et al. [26] used semantic type filtering to improve the performance connectivity of medical entities across all toolkits and datasets, which provided a new semantic type prediction module for the biomedical NLP pipeline. Topaz et al. [27] used an NLP-based classification system, support vector machine (SVM), recurrent neural network (RNN), and other machine learning methods to identify diabetic patients from clinical records and reduce the manual workload in medical text data mining.

An external file that holds a picture, illustration, etc. Object name is JHE2021-1285167.001.jpg

Schematic of natural language processing flow.

3.3. Data Analysis

Data analysis is applying data mining methods for extracting interesting patterns. The model establishment is essential for knowledge discovery in data analysis. According to the characteristics of the data, modeling and analysis are performed. After the initial test, the model is parametrically adjusted. The advantages and disadvantages of different models are analyzed to choose the final optimal model. Data analysis methods for medical text data include clustering, classification, association rules, and regression on the goal. The detail information of methods is shown in Table 2 .

Table 2

The information of analysis methods for medical text data.

MethodsPurposeAlgorithmsAdvantagesShortcomings
ClusteringClassify similar subjects in medical texts K-means [28, 29]1.Simple and fast
2. Scalability and efficiency
1. Large amount of data and time-consuming
2. More restrictions on use
ClassificationRead medical text data for intention recognitionANN [30, 31]1. Solve complex mechanisms in text data
2. High degree of self-learning
3. Strong fault tolerance
1. Slow training
2. Many parameters and difficulty in adjusting parameters
Decision tree [32, 33]1. Handle continuous variables and missing values
2. Judge the importance of features
1. Overfitting
2. The result is unstable
Naive bayes [34]1. The learning process is easy
2. Good classification performance
Higher requirements for data independence
Association rulesMine frequent items and corresponding association rules from massive medical text datasetsApriori [35, 36]Simple and easy to implementLow efficiency and time-consuming
FP-tree [37]1. Reduce the number of database scans
2. Reduce the amount of memory space
High memory overhead
FP-growth [38]1. Improve data density structure
2. Avoid repeated scanning
Harder to achieve
Logistic RegressionAnalyze how variables affect resultsLogistic regression [39]1.Visual understanding and interpretation
2. Very sensitive to outliers
1.Easy underfitting
2. Cannot handle a large number of multiclass features or variables

3.3.1. Artificial Neural Network

Artificial Neural Network (ANN) is a nonlinear prediction model that is learned by training, which has the advantages of accurate classification, self-learning, associative memory, and high speed searching for the optimal solution and good stability in data mining. ANN mainly consists of three parts: input layer, hidden layer, and output layer [40]. The input layer is responsible for receiving external information and data. The hidden layer is responsible for processing information and constantly adjusting the connection properties between neurons, such as weights and feedback, while the output layer is responsible for outputting the calculated results. ANN is different from traditional artificial intelligence and information processing technology, which overcomes the drawbacks of traditional artificial intelligence based on logical symbols in processing intuitive and unstructured information, and has the characteristics of self-adaption, self-organizing, and real-time learning. It can complete data classification, feature mining, and other mining tasks. Medical text data contain massive amounts of patient health records, vital signs, and other data. ANN can analyze the conditions of patients' rehabilitation, find the law of patient data, predict the patient's condition or rehabilitation, and help to discover medical knowledge [41].

There are several ANN mining techniques that are used for medical text data, such as backpropagation and factorization machine-supported neural network (FNN). The information on ANN mining techniques is shown in Table 3 .

Table 3

The information of ANN mining techniques.

ANN mining techniquesAdvantagesShortcomings
Backpropagation [42]1. Strong nonlinear mapping capability
2. Strong generalization ability
3. Strong fault tolerance
1. Local minimization
2. Slow convergence
3. Different structure choices
Radial basis function [43]1. Fast learning speed
2. Easy to solve text data classification problems
Complex structure
FNN [44]1.Reduce feature engineering
2. Improve FM learning ability
Limited modeling capability

(1) ANN Core Algorithm: BP Algorithm. Backpropagation (BP) algorithm, as the classical algorithm of the ANN, widely used for medical text data. BP algorithm is developed on the basis of single-layer neural network. It uses reverse propagation to adjust the weights and construct multilayer network, so that the system can continue to learn. BP is a multilayered feed-forward network and its propagation is forward. Compared with recurrent neural network algorithms, error spreads reversely makes it faster and more powerful for high-throughput microarray or sequencing data modeling [45].

BP algorithm training data is mainly divided into the following two stages:

Forward propagation process: the actual output values of each computer unit are implicitly processed layer by layer from the input layer

Backpropagation process: when the output value does not reach the expected value, the difference between the actual output and the expected output is calculated recursively, and the weight is adjusted according to the difference. The total error is defined as

E = ∑ k = 1 m E k = ∑ k = 1 m ∑ t = 1 q y t k − c t k 2 2 .

m is the total number of samples. K is the sample data order. T is the unit serial number. yt k is the desired output. ct k is the actual output.

In clinics, the judgment of disease is often determined by the integration of multidimensional data. In the establishment of disease prediction models, BP algorithms can not only effectively classify complex data but also have good multifunctional mapping. The relationship between data and disease can be found in the process of repeated iteration [46].

(2) Application Examples. Adaptive learning based on ANN can find the law of medical development from the massive medical text data and assist the discovery of medical knowledge. Heckerling et al. [47] combined a neural network and genetic algorithm to predict the prognosis of patients with urinary tract infections (as shown in Figure 2 ). In this study, nine indexes (eg, frequent micturition, dysuria, etc.) from 212 women with urinary tract infections were used as predictor variables for training. The relationship between symptoms and urinalysis input data and urine culture output data was determined using ANN. The predicted results were accurate.

An external file that holds a picture, illustration, etc. Object name is JHE2021-1285167.002.jpg

ANN algorithm analysis process.

Miotto et al. [48] derived a general-purpose patient representation from aggregated EHRs based on ANN that facilitates clinical predictive modeling given the patient status. Armstrong et al. [49] used ANN to analyze 240 microcalcifications in 220 cases of mammography. Data mining results can accurately predict whether the microcalcification in the early stage of suspected breast cancer is benign or malignant.

3.3.2. Naive Bayes

Naive Bayes (NB) is a classification counting method based on the Bayes theory [50]. The conditional independence hypothesis of the NB classification algorithm assumes that the attribute values are independent of each other and the positions are independent of each other [51]. Attribute values are independent of each other, which means there is no dependence between terms. The position independence hypothesis means that the position of the term in the document has no effect on the calculation of probability. However, conditional dependence exists among terms in medical texts, and the location of terms in documents contributes differently to classification [52]. But medical text existence conditions depend on the relationship between a middle term and the term in the document; the location of the contribution to the classification is different. These two independent assumptions lead to the poor effect of NB estimation. However, NB has been widely used in medical texts because it plays an effective role in classification decision-making.

(1) Core Algorithm: NBC4D. Naive Bayes classifier for continuous variables using a novel method (NBC4D) is a new algorithm based on NB. It classifies continuous variables into Naive Bayes classes, replaces traditional distribution techniques with alternative distribution techniques, and improves classification accuracy by selecting appropriate distribution techniques [53]. The implementation of the NBC4D algorithm is mainly divided into five steps:

Gaussian Distribution: f x , μ , σ = 1 / σ 2 π e − x − μ 2 / 2 σ 2 Exponential Distribution:f(x)=1/αex/θ Kernel Density Estimation: f(x)=1/nhi=1 n K(xxi/h) Rayleigh Distribution: f(x, α, θ)=x/αex 2 /2θ

NBC4D Method: find the product of the probability (possibility) of each attribute of a given specific class and the probability of a specific class to improve the accuracy

x is the input value, μ is the mean value, σ2 is the variance, α is the parameter that represents the average value (μ), θ represents the standard deviation (σ), K is the kernel function of Gaussian function, and h is the smoothing parameter.

(2) Application Examples. Behrouz Ehsani Moghaddam et al. [54] adopted electronic medical records (EMRs) extracted from the Canadian primary care sentinel surveillance network, used the Naive Bayes algorithm to classify disease features, and found that Naive Bayes classifier was an effective algorithm to help physicians diagnose Hunter syndrome and optimize patient management (as shown in Figure 3 ). In order to predict angiographic outcomes, Golpour et al. [55] used the NB algorithm to process the hospital medical records and assessment scale and found that the NB model with three variables had the best performance and could well support physician decision-making.

An external file that holds a picture, illustration, etc. Object name is JHE2021-1285167.003.jpg

NB algorithm analysis process.

3.3.3. Decision Tree

The decision tree is a tree structure, in which each nonleaf node represents a test on a feature attribute, each branch represents the output of the feature attribute on a certain value domain, and each leaf node stores a category [56]. The process of using a decision tree to make a decision is to start from the root node, then test the corresponding characteristic attributes of the items to be classified, select the output branch according to its value until it reaches the leaf node, and finally take the category stored in the leaf node as the decision result [57]. The advantages of decision tree learning algorithms include good interpretability induction, various types of data processing (categorical and numerical data), white-box modeling, sound robust performance for noise, and large dataset processing. Medical text data is complex [58]. For instance, electronic medical record data include not only disease characteristics but also patient age, gender, and other characteristic data. Since the construction of decision tree starts from a single node, the training data set is divided into several subsets according to the attributes of the decision node, so the decision tree algorithm can deal with the data types and general attributes at the same time, which has certain advantages for the complexity of medical text data processing [59]. The construction of a decision tree is mainly divided into two steps: classification attribute selection and number pruning. The common algorithm is C4.5 [60].

(1) Core algorithm: C4.5. Several decision tree algorithms are proposed such as ID3 and C4.5. The famous ID3 algorithm proposed by Quinlan in 1986 has the advantages of clear theory, simple method, and strong learning ability. The disadvantage is that it is only effective for small datasets and sensitive to noise. When the training data set increases, the decision tree may change accordingly. When selecting test attributes, the decision tree tends to select attributes with more values. In 1993, Quinlan proposed the C4.5 algorithm based on the ID3 algorithm [61]. Compared with ID3, C4.5 overcomes the shortages of selecting more attributes in information attribute selection, prunes the tree construction process, and processes incomplete data. And it uses the gain ratio as the selection standard of each node attribute in the decision tree [62]. In particular, its extension which is called S-C4.5-SMOTE and can not only overcome the problem of data distortion but also improve overall system performance. Its mechanism aims to effectively reduce the amount of data without distortion by maintaining the balance of datasets and technical smoothness.

The processing formula is as follows:

Information entropy : H x = − ∑ i = 1 n p x i log 2 p x i , Split information A , S = − ∑ i = 1 c S i S log 2 S i S , Gain ratio A , S = Gain S , A Split information S , A .

n is the classification number. p(xi) represents the proportion of sample xi. A is used as the feature of dividing data set S. (|Si|/|S|) is the proportion of the number of samples in the total number of samples.

(2) Application Examples. The decision tree algorithms can construct specific decision trees for multiattribute datasets and get feasible results in relative time. It can be used as a good method for data classification in medical text data mining.

Byeon [63] used the C4.5 algorithm to develop a depression prediction model for Korean dementia caregivers based on a secondary analysis of the 2015 Korean Community Health Survey (KCHS) survey results. And the effective prediction rate was 70%. The overall research idea is shown in Figure 4 .

An external file that holds a picture, illustration, etc. Object name is JHE2021-1285167.004.jpg

C4.5 algorithm application flow.

Wei et al. [64] selected the reports from the Chinese spontaneous report database from 2010 to 2011 and used a decision tree to calculate the classification of adverse drug reactions (ADR) signals. Tao Zheng et al. [65] adopted a decision tree algorithm to construct a basic data framework. 300 data were randomly selected from the EHR of 23281 diabetic patients to classify the type of diabetes. The performance of the framework was good and the classification accuracy was as high as 98%.

However, decision tree algorithms are difficult to deal with missing values in data. And there are many missing values in medical text data, due to the high complexity of data. Therefore, when various types of data are inconsistent, the decision tree algorithms will produce information deviation, and the correct results cannot be obtained.

3.3.4. Association Rules

Association rules are often sought for very large datasets, whose efficient algorithms are highly valued. They are used to discover the correlations from large amounts of data and reflect the dependent or related knowledge between events and other events [66]. Medical text data contains a large number of association data, such as the association between symptoms and diseases and the relationship between drugs and diseases. Mining medical text data using an association rule algorithm is conducive to discovering the potential links in medical text data and promoting the development of medicine. Association rules are expressions like XY. There are two key expressions in the transaction database:

Support<XY>. The ratio of the number of transactions with X and Y to all transactions

Confidence<XY>. The ratio of the number of transactions with X and Y to the number of transactions with X

Given a transaction data set, mining association rules is to generate association rules whose support and trust are greater than the minimum support and minimum confidence given by users, respectively.

(1) Core Algorithm: Apriori. The apriori algorithm is the earliest and the most classic algorithm. The iterative search method is used to find the relationship between items in the database layer by layer. The process consists of connection (class matrix operation) and pruning (removing unnecessary intermediate results). In this algorithm, the concept of item set is the set of items. A set containing K items is a set of K items. Item set frequency is the number of transactions that contain an item set. If an item set satisfies the minimum support, it is called a frequent item set.

Apriori algorithm is divided into two steps to find the largest item set:

Count the occurrence frequency of an element item set, and find out the data set which is not less than the minimum support to form a one-dimensional maximum item set

Loop until no maximum item set is generated

(2) Application Examples. Association rules are usually a data mining approach used to explore and interpret large transactional datasets to identify unique patterns and rules. They are often used to predict the correlation between index data and diseases. Exarchos et al. [67] proposed an automation method based on association rules, used an association rule algorithm to classify and model electrocardiographic (ECG) data, and monitored ischemic beats in ECG for a long time. In this study, the specific application process of association rules is shown in Figure 5 .

An external file that holds a picture, illustration, etc. Object name is JHE2021-1285167.005.jpg

Application process of association rules.

Hrovat et al. [68] combined association rule mining, which was designed for mining large transaction datasets, with model-based recursive partitioning to predict temporal trends (e.g., behavioral patterns) for subgroups of patients based on discharge summaries. In the correlation analysis between adverse drug reaction events and drug treatment, Chen et al. [69] used the apriori algorithm to explore the relationship between adverse events and drug treatment in patients with non-small-cell lung cancer, showing a promising method to reveal the risk factors of adverse events in the process of cancer treatment. In the association between drugs and diseases, Lu et al. [70] used the apriori algorithm to find herbal combinations for the treatment of uremic pruritus from Chinese herb bath therapy and explore the core drugs.

3.4. Model Evaluation

Classifications generated by data mining models through test sets are not necessarily optimal, which can lead to the error of test set classification. In order to get a perfect data model, it is very important to evaluate the model. Receiver operating characteristic (ROC) curve and area under the curve (AUC) are common evaluation methods in medical text data mining.

The ROC curve has a y-axis of TPR (sensitivity, also called recall rate) and an x-axis of FPR (1-specificity). The higher the TPR, the smaller the FPR, and the higher the efficiency of the model. AUC is defined as the area under the ROC curve, that is, AUC is the integral of ROC, and the value of the area is less than 1. We randomly select a positive sample and a negative sample. The probability that the classifier determines that the positive sample value is higher than the negative sample is the AUC value. Pourhoseing Holi et al. [71] used the AUC method to evaluate the prognosis model of rectal cancer patients and found that the prediction accuracy of random forest (RF) and BN models was high.

4. Discussion

Data mining is useful for medical text data to extract novel and usable information or knowledge. This paper reviewed several research works which are done for mining medical text data based on four steps. It is beneficial for helping the researchers to choose reasonable approaches for mining medical text data. However, some difficulties in medical text data mining are also considered.

First, the lack of a publicly available annotation database affects the development of data mining to a certain extent, due to differences in medical information records and descriptions among countries. Its information components are highly heterogeneous and the data quality is not uniform. Ultimately, it brings about a key obstacle that makes annotation bottleneck existing in medical text data [72]. At present, the international standards include ICD (International Classification of Diseases), SNOMED CT (The Systematized Nomenclature of Human and Veterinary Medicine Clinical Terms), CPT (Current Procedural Terminology), DRG (Diagnosis-Related Groups), LOINC (Logical Observation Identifiers Names and Codes), Mesh (Medical Subject Headings), MDDB (Main Drug Database), and UMLS (Unified Medical Language System). There are few corpora in the field of medical text. In recent 10 years, natural language has undergone a truly revolutionary paradigm shift. More new technologies have been applied to the extraction of natural language information. Many scholars have established a corpus for a certain disease. However, there is a close relationship between medical entities. A single corpus cannot cut the data accurately, and it is easy to omit keyword information.

Second, text records of different countries have different opinions. For example, Ayurvedic medicine, traditional Arab Islamic medicine, and traditional Malay medicine from India, the Middle East, and Malaysia have problems such as inconsistent treatment description, complex treatment methods, and difficulty in statistical analysis, leading to great difficulty in medical data mining [73]. At the same time, the information construction of traditional medicine is insufficient. For example, the traditional North American indigenous medical literature mainly involves clinical efficacy evaluation and disease application, which is complicated in recording methods, leading to difficulty of data mining [74]. Chinese medical texts have the particularity of language. Unlike English expressions, Chinese words are not separated from each other, which increases the difficulty of data analysis. In terms of semantics, Chinese medical texts have problems such as existential polysemy, synonym, the ambiguity of expression, complex relationship, and lack of clear correlation. Building a standard database based on these data is very difficult, which requires very advanced and complex algorithms.

In addition, the electronic medical record contains personal privacy information. Sometimes, the clinical electronic medical record data will inevitably be used in medical text data mining. Therefore, the protection of patient privacy data is also an issue that needs to be paid attention to in data mining.

In future work, we will attempt to establish and popularize medical text data standards with the help of intelligent agents and construct publicly available annotation databases for the mining of medical text data.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (81703825), the Sichuan Science and Technology Program (2021YJ0254), and the Natural Science Foundation Project of the Education Department of Sichuan Province (18ZB01869).

Abbreviations

EMC:The U.S.A EMC company
NLP:Natural language processing
ANN:Artificial neural network
BP:Backpropagation
ROC:Receiver operating characteristic
TPR:True positive rate
FPR:False positive rate
AUC:Area under the curve
RF:Random forest
BN:Bayesian network
EHR:Electronic health record
ICD:International classification of diseases
SNOMED CT:The systematized nomenclature of human and veterinary medicine clinical terms
CPT:Current procedural terminology
DRG:Diagnosis-related groups
Mesh:Medical subject headings
LOINC:Logical observation identifiers names and codes
UMLS:Unified medical language system
MDDB:Main drug database
SVM:Support vector machine
RNN:Recurrent neural network
ID3:Iterative Dichotomiser 3
KCHS:Korean community health survey
ADR:Adverse drug reactions
ECG:electrocardiographic
FNN:Factorization machine-supported neural network.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

1. Oskouei R. J., Kor N. M., Maleki S. A. Data mining and medical world: breast cancers’ diagnosis, treatment, prognosis and challenges [J] American journal of cancer research . 2017; 7 (3):610–627. [PMC free article] [PubMed] [Google Scholar]

2. Zhang Y., Guo S.-L., Han L.-N., Li T.-L. Application and exploration of big data mining in clinical medicine. Chinese Medical Journal . 2016; 129 (6):731–738. doi: 10.4103/0366-6999.178019. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

3. Polnaszek B., Gilmore-Bykovskyi A., Hovanes M., et al. Overcoming the challenges of unstructured data in multisite, electronic medical record-based abstraction. Medical Care . 2016; 54 (10):e65–e72. doi: 10.1097/mlr.0000000000000108. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

4. Ford E., Oswald M., Hassan L., Bozentko K., Nenadic G., Cassell J. Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK. Journal of Medical Ethics . 2020; 46 (6):367–377. doi: 10.1136/medethics-2019-105472. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

5. Ayyoubzadeh S. M., Ayyoubzadeh S. M., Zahedi H., Ahmadi M., R Niakan Kalhori S. Predicting COVID-19 incidence through analysis of google trends data in Iran: data mining and deep learning pilot study. JMIR public health and surveillance . 2020; 6 (2) doi: 10.2196/18828. e18828 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

6. Ren X., Shao X. X., Li X. X., et al. Identifying potential treatments of COVID-19 from Traditional Chinese Medicine (TCM) by using a data-driven approach. Journal of Ethnopharmacology . 2020; 258 (1) doi: 10.1016/j.jep.2020.112932. 12932 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

7. Massaad E., Cherfan P. Social media data analytics on telehealth during the COVID-19 pandemic. Cureus . 2020; 12 (4) e7838 [PMC free article] [PubMed] [Google Scholar]

8. Dong J., Wu H., Zhou D., et al. Application of big data and artificial intelligence in COVID-19 prevention, diagnosis, treatment and management decisions in China. Journal of Medical Systems . 2021; 45 (9):p. 84. doi: 10.1007/s10916-021-01757-0. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

9. Moreira L. B., Namen A. A. A hybrid data mining model for diagnosis of patients with clinical suspicion of dementia [J] Computer Methods and Programs in Biomedicine . 2018; 165 (1):39–49. doi: 10.1016/j.cmpb.2018.08.016. [PubMed] [CrossRef] [Google Scholar]

10. Vilar S., Friedman C., Hripcsak G. Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Briefings in Bioinformatics . 2018; 19 (5):863–877. doi: 10.1093/bib/bbx010. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

11. Cha H. S., Yoon T. S., Ryu K. C., et al. Implementation of hospital examination reservation system using data mining technique. Healthcare informatics research . 2015; 21 (2):95–101. doi: 10.4258/hir.2015.21.2.95. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

12. Gudenas B. L., Wang J., Kuang S.-z., Wei A.-q., Cogill S. B., Wang L.-j. Genomic data mining for functional annotation of human long noncoding RNAs. Journal of Zhejiang University - Science B . 2019; 20 (6):476–487. doi: 10.1631/jzus.b1900162. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

13. Evans R. S. Electronic health records: then, now, and in the future. Yearbook of medical informatics . 2016; Suppl 1 (Suppl 1):S48–S61. doi: 10.15265/iys-2016-s006. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

14. Austin P. C., White I. R., Lee D. S., van Buuren S. Missing data in clinical research: a tutorial on multiple imputation. Canadian Journal of Cardiology . 2021; 37 (9):1322–1331. doi: 10.1016/j.cjca.2020.11.010. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

15. Yu L., Liu L., Peace K. E. Regression multiple imputation for missing data analysis. Statistical Methods in Medical Research . 2020; 29 (9):2647–2664. doi: 10.1177/0962280220908613. [PubMed] [CrossRef] [Google Scholar]

16. Chang P. C., Wang C. L., Hsiao F. C., et al. Sacubitril/valsartan vs. angiotensin receptor inhibition in heart failure: a real‐world study in Taiwan. ESC heart failure . 2020; 7 (5):3003–3012. doi: 10.1002/ehf2.12924. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

17. Tavazzi E., Daberdaku S., Vasta R., Andrea C., Adriano C., Barbara D. C. Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach. BMC Medical Informatics and Decision Making . 2020; 20 (Suppl 5):p. 174. doi: 10.1186/s12911-020-01166-2. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

18. Idri A., Kadi I., Abnane I., Fernandez-Aleman J. L. Missing data techniques in classification for cardiovascular dysautonomias diagnosis. Medical, & Biological Engineering & Computing . 2020; 58 (11):2863–2878. doi: 10.1007/s11517-020-02266-x. [PubMed] [CrossRef] [Google Scholar]

19. Wang C., Yao C., Chen P., Jiamin S., Zhe G., Zheying Z. Artificial intelligence algorithm with ICD coding technology guided by the embedded electronic medical record system in medical record information management. Journal of healthcare engineering . 2021; 2021 :9. doi: 10.1155/2021/3293457. 3293457 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

20. Kreimeyer K., Foster M., Pandey A., et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. Journal of Biomedical Informatics . 2017; 73 :14–29. doi: 10.1016/j.jbi.2017.07.012. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

21. Kuo T. T., Rao P., Maehara C., et al. Ensembles of NLP tools for data element extraction from clinical notes. AMIA Annual Symposium proceedings AMIA Symposium . 2017; 2016 :1880–1889. [PMC free article] [PubMed] [Google Scholar]

22. Jonnagaddala J., Liaw S.-T., Ray P., Kumar M., Chang N.-W., Dai H.-J. Coronary artery disease risk assessment from unstructured electronic health records using text mining. Journal of Biomedical Informatics . 2015; 58 (Suppl):S203–S210. doi: 10.1016/j.jbi.2015.08.003. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

23. Trivedi G., Dadashzadeh E. R., Handzel R. M., Wendy W. C., Shyam V., Harry H. Interactive NLP in clinical care: identifying incidental findings in radiology reports. Applied Clinical Informatics . 2019; 10 (4):655–669. doi: 10.1055/s-0039-1695791. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

24. Datta S., Bernstam E. V., Roberts K. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes [J] Journal of Biomedical Informatics . 2019; 100 (1):03–301. doi: 10.1016/j.jbi.2019.103301. [PubMed] [CrossRef] [Google Scholar]

25. Roberts K., Demner-Fushman D. Annotating logical forms for EHR questions [J]. LREC. International Conference on Language Resources & Evaluation: [proceedings] International Conference on Language Resources and Evaluation . 2016; 2016 (3):772–778. [PMC free article] [PubMed] [Google Scholar]

26. Vashishth S., Newman-Griffis D., Joshi R., Ritam D., Carolyn P. R. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. Journal of Biomedical Informatics . 2021; 121 (10):38–80. doi: 10.1016/j.jbi.2021.103880. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

27. Topaz M., Murga L., Bar-Bachar O., McDonald M., Bowles K. NimbleMiner. CIN: Computers, Informatics, Nursing . 2019; 37 (11):583–590. doi: 10.1097/cin.0000000000000557. [PubMed] [CrossRef] [Google Scholar]

28. Maslove D. M., Podchiyska T., Lowe H. J. Discretization of continuous features in clinical datasets. Journal of the American Medical Informatics Association . 2013; 20 (3):544–553. doi: 10.1136/amiajnl-2012-000929. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

29. Yildirim P., Majnarić L., Ekmekci O., Andreas H. Knowledge discovery of drug data on the example of adverse reaction prediction. BMC Bioinformatics . 2014; 15 (Suppl 6):p. S7. doi: 10.1186/1471-2105-15-s6-s7. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

30. Ayatollahi H., Gholamhosseini L., Salehi M. Predicting coronary artery disease: a comparison between two data mining algorithms. BMC Public Health . 2019; 19 (1):p. 448. doi: 10.1186/s12889-019-6721-5. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

31. Reiser M., Wiebner B., Wiebner B., Hirsch J. Neural-network analysis of socio-medical data to identify predictors of undiagnosed hepatitis C virus infections in Germany (DETECT) Journal of Translational Medicine . 2019; 17 (1):p. 94. doi: 10.1186/s12967-019-1832-4. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

32. Rahman M. A., Honan B., Glanville T., Hough P., Walker K. Using data mining to predict emergency department length of stay greater than 4 hours: d. Emergency Medicine Australasia . 2020; 32 (3):416–421. doi: 10.1111/1742-6723.13421. [PubMed] [CrossRef] [Google Scholar]

33. Lee J.-A., Kim K.-H., Kong D.-S., Lee S., Park S.-K., Park K. Algorithm to predict the outcome of mdh spasm: a data-mining analysis using a decision tree. World neurosurgery . 2019; 125 (5):e797–e806. doi: 10.1016/j.wneu.2019.01.172. [PubMed] [CrossRef] [Google Scholar]

34. Awaysheh A., Wilcke J., Elvinger F., Rees L., Fan W., Zimmerman K. L. Review of medical decision support and machine-learning methods. Veterinary pathology . 2019; 56 (4):512–525. doi: 10.1177/0300985819829524. [PubMed] [CrossRef] [Google Scholar]

35. You X., Xu Y., Huang J., et al. A data mining-based analysis of medication rules in treating bone marrow suppression by kidney-tonifying method [J]. Evidence-based complementary and alternative medicine. eCAM . 2019;(1):p. 907848. [PMC free article] [PubMed] [Google Scholar]

36. Atashi A., Tohidinezhad F., Dorri S., et al. Discovery of hidden patterns in breast cancer patients, using data mining on a real data set. Studies in Health Technology and Informatics . 2019; 262 (1):42–45. [PubMed] [Google Scholar]

37. Luo Z., Zhang G. Q., Xu R. Mining patterns of adverse events using aggregated clinical trial results [J] AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science . 2013; 2013 (1):12–16. [PMC free article] [PubMed] [Google Scholar]

38. Li X., Liu G., Chen W., Bi Z., Liang H. Network analysis of autistic disease comorbidities in Chinese children based on ICD-10 codes. BMC Medical Informatics and Decision Making . 2020; 20 (1):p. 268. doi: 10.1186/s12911-020-01282-z. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

39. Liu M. M., Wen L., Liu Y. J., Qiao C., Li T. L., Yong M. C. Application of data mining methods to improve screening for the risk of early gastric cancer. BMC Medical Informatics and Decision Making . 2018; 18 (Suppl 5):p. 121. doi: 10.1186/s12911-018-0689-4. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

40. Wu Y.-c., Feng J.-w. Development and application of artificial neural network. Wireless Personal Communications . 2018; 102 (2):1645–1656. doi: 10.1007/s11277-017-5224-x. [CrossRef] [Google Scholar]

41. Ramesh A., Kambhampati C., Monson J., Drew P. Artificial intelligence in medicine. Annals of the Royal College of Surgeons of England . 2004; 86 (5):334–338. doi: 10.1308/147870804290. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

42. Liang Y., Li Q., Chen P., Xu L., Li J. Comparative study of back propagation artificial neural networks and logistic regression model in predicting poor prognosis after acute ischemic stroke. Open Medicine . 2019; 14 (1):324–330. doi: 10.1515/med-2019-0030. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

43. Park S. Y., Kim S. M. Acute appendicitis diagnosis using artificial neural networks [J]. Technology and health care. Official Journal of the European Society for Engineering and Medicine . 2015; 23 (Suppl 2):S559–S565. doi: 10.3233/thc-150994. [PubMed] [CrossRef] [Google Scholar]

44. Kuo R.-J., Huang M.-H., Cheng W.-C., Lin C.-C., Wu Y.-H. Application of a two-stage fuzzy neural network to a prostate cancer prognosis system. Artificial Intelligence in Medicine . 2015; 63 (2):119–133. doi: 10.1016/j.artmed.2014.12.008. [PubMed] [CrossRef] [Google Scholar]

45. Liu L., Zhao T., Ma M., Wang Y. A new gene regulatory network model based on BP algorithm for interrogating differentially expressed genes of Sea Urchin. SpringerPlus . 2016; 5 (1):p. 1911. doi: 10.1186/s40064-016-3526-1. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

46. Cleophas T. J., Cleophas T. F. Artificial intelligence for diagnostic purposes: principles, procedures and limitations [J] Clinical Chemistry and Laboratory Medicine . 2010; 48 (2):159–165. doi: 10.1515/cclm.2010.045. [PubMed] [CrossRef] [Google Scholar]

47. Heckerling P., Canaris G., Flach S., Tape T., Wigton R., Gerber B. Predictors of urinary tract infection based on artificial neural networks and genetic algorithms. International Journal of Medical Informatics . 2007; 76 (4):289–296. doi: 10.1016/j.ijmedinf.2006.01.005. [PubMed] [CrossRef] [Google Scholar]

48. Miotto R., Li L., Kidd B. A. Deep patient: an unsupervised representation to predict the future of patients from the electronic health. Records [J]. Scientific reports . 2016; 6 (2):p. 6094. doi: 10.1038/srep26094. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

49. Armstrong A. J., Marengo M. S., Oltean S., et al. Circulating t cells from patients with advanced prostate and breast cancer display both epithelial and mm. Molecular Cancer Research . 2011; 9 (8):997–1007. doi: 10.1158/1541-7786.mcr-10-0490. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

50. Lindley D. V. Fiducial distributions and bayes’ theorem. Journal of the Royal Statistical Society: Series B . 1958; 20 (1):102–107. doi: 10.1111/j.2517-6161.1958.tb00278.x. [CrossRef] [Google Scholar]

51. Uddin S., Khan A., Hossain M. E., Moni M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making . 2019; 19 (1):p. 281. doi: 10.1186/s12911-019-1004-8. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

52. Rashidi H. H., Tran N. K., Betts E. V., Lydia P. H., Ralph G. Artificial intelligence and machine learning in pathology: the present landscape of supervised methods. Academic pathology . 2019; 6 doi: 10.1177/2374289519873088. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

53. Yildirim P., Birant D. Naive Bayes classifier for continuous variables using novel method (NBC4D) and distributions. Proceedings of the 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA); June 2014; Alberobello, Italy. IEEE; pp. 110–115. [Google Scholar]

54. Ehsani-Moghaddam B., Queenan J. A., Mackenzie J., Birtwhistle R. V. Mucopolysaccharidosis type II detection by Naïve Bayes Classifier: an example of patient classification for a rare disease using electronic medical records from the Canadian Primary Care Sentinel Surveillance Network. PLoS One . 2018; 13 (12) doi: 10.1371/journal.pone.0209018. e0209018 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

55. Golpour P., Ghayour-Mobarhan M., Saki A., et al. Comparison of support vector machine, naïve bayes and logistic regression for assessing the necessity for coronary angiography. International Journal of Environmental Research and Public Health . 2020; 17 (18) doi: 10.3390/ijerph17186449. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

56. Che D., Liu Q., Rasheed K., Tao X., Xiuping T. Decision tree and ensemble learning algorithms with their applications in bioinformatics. Advances in Experimental Medicine & Biology . 2011:191–199. doi: 10.1007/978-1-4419-7046-6_19. [PubMed] [CrossRef] [Google Scholar]

57. Moraes L. O., Pedreira C. E., Barrena S., Lopez A., Orfao A. A decision-tree approach for the differential diagnosis of chronic lymphoid leukemias and peripheral B-cell lymphomas. Computer Methods and Programs in Biomedicine . 2019; 178 :85–90. doi: 10.1016/j.cmpb.2019.06.014. [PubMed] [CrossRef] [Google Scholar]

58. Oh W., Steinbach M. S., Castro M. R., et al. Evaluating the impact of data representation on EHR-based analytic tasks. Studies in Health Technology and Informatics . 2019; 264 (2):88–92. [PMC free article] [PubMed] [Google Scholar]

59. Nakas C. T., Schütz N., Werners M., Leichtle A. B. Accuracy and calibration of computational approaches for inpatient mortality predictive modeling. PLoS One . 2016; 11 (7) doi: 10.1371/journal.pone.0159046. e0159046 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

60. Quliilan J. R. C4.5: Programs for Machine Learning . Vol. 993. San Mateo, CA: Mor⁃gan Kaufmann Publisher; [Google Scholar]

61. JR Q. C4.5:Programs for Machine Learning . San Mateo: Morgan Kaufmann Publishers; 1993. [Google Scholar]

62. Franzese G., Visintin M. Probabilistic ensemble of deep information networks. Entropy . 2020; 22 (1):p. 100. doi: 10.3390/e22010100. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

63. Byeon H. Development of depression prediction models for caregivers of patients with dementia using decision tree learning algorithm. International Journal of Gerontology . 2019; 13 (4):314–319. [Google Scholar]

64. Wei J.-X., Wang J., Zhu Y.-X., Sun J., Xu H.-M., Li M. Traditional Chinese medicine pharmacovigilance in signal detection: decision tree-based data classification. BMC Medical Informatics and Decision Making . 2018; 18 (1):p. 19. doi: 10.1186/s12911-018-0599-5. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

65. Zheng T., Xie W., Xu L., et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. International Journal of Medical Informatics . 2017; 97 :120–127. doi: 10.1016/j.ijmedinf.2016.09.014. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

66. Veroneze R., Cruz T., Corbi S., et al. Using association rule mining to jointly detect clinical features and differentially expressed genes related to chronic inflammatory diseases. PLoS One . 2020; 15 (10) doi: 10.1371/journal.pone.0240269. e0240269 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

67. Exarchos T. P., Papaloukas C., Fotiadis D. I., Michalis L. K. An association rule mining-based methodology for automated detection of ischemic ECG beats. IEEE Transactions on Biomedical Engineering . 2006; 53 (8):1531–1540. doi: 10.1109/tbme.2006.873753. [PubMed] [CrossRef] [Google Scholar]

68. Hrovat G., Stiglic G., Kokol P., Ojsteršek M. Contrasting temporal trend discovery for large healthcare databases. Computer Methods and Programs in Biomedicine . 2014; 113 (1):251–257. doi: 10.1016/j.cmpb.2013.09.005. [PubMed] [CrossRef] [Google Scholar]

69. Chen W., Yang J., Wang H. L., Shi Y. F, Tang H, Li G. H. Discovering associations of adverse events with pharmacotherapy in patients with non-small cell lung cancer using modified Apriori algorithm. BioMed Research International . 2018; 2018 (12):10. doi: 10.1155/2018/1245616. 1245616 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

70. Lu P. H., Keng J. L., Kuo K. L., Yu F. W., Yu C. T., Chan Y. K. An Apriori algorithm-based association rule analysis to identify herb combinations for treating uremic pruritus using Chinese herbal bath therapy. Evidence-based Complementary and Alternative Medicine: eCAM . 2020; 2020 (8):9. doi: 10.1155/2020/8854772. 854772 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

71. Mlakar M., Puddu P. E., Somrak M., Bonfiglio S., Luštrek M. Mining telemonitored physiological data and patient-reported outcomes of congestive heart failure patients. PLoS One . 2018; 13 (3) doi: 10.1371/journal.pone.0190323. e0190323 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

72. Spasic I., Nenadic G. Clinical text data in machine learning: systematic review. JMIR Medical Informatics . 2020; 8 (3) doi: 10.2196/17984. e17984 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

73. Ikram R. R. R., Ghani M. K. A., Abdullah N. An analysis of application of health informatics in Traditional Medicine: a review of four Traditional Medicine Systems. International Journal of Medical Informatics . 2015; 84 (11):988–996. doi: 10.1016/j.ijmedinf.2015.05.007. [PubMed] [CrossRef] [Google Scholar]

74. Redvers N., Blondin B. s. Traditional Indigenous medicine in North America: a scoping review. PLoS One . 2020; 15 (8) doi: 10.1371/journal.pone.0237531. e0237531 [PMC free article] [PubMed] [CrossRef] [Google Scholar]