Data Processing, Organisation, Cleaning and Validation - Financial Management and Business Data Analytics | CMA Inter Syllabus
Table of Content
CMA Inter Blogs :
Data processing (DP) is the process of organising, categorising, and manipulating data in order to extract information. Information in this context refers to valuable connections and trends that may be used to address pressing issues. In recent years, the capacity and effectiveness of DP have increased manifold with the development of technology.
Data processing that used to require a lot of human labour progressively superseded by modern tools and technology. The techniques and procedures used in DP information extraction algorithms for data are well developed in recent years, for instance, the treatment of facial data classification is necessary for recognition, and time series analysis is necessary for processing stock market data.
The information extracted as a result of DP is also heavily reliant on the quality of the data. Data quality may get affected due to several issues like missing data and duplications. There may be other fundamental problems, such as incorrect equipment design and biased data collecting, which are more difficult to address.The history of DP can be divided into three phases as a result of technological advancements
(i) Manual DP: Manual DP involves processing data without much assistance from machines. Prior to the phase of mechanical DP only small-scale data processing was possible using manual efforts. However, in some special cases Manual DP is still in use today, and it is typically due to the data’s difficulty in digitization or inability to be read by machines, like in the case of retrieving data from outdated texts or documents
(ii) Mechanical DP: Mechanical DP processes data using mechanical (not modern computers) tools and technologies. This phase began in 1890 (Bohme et al., 1991) when a system made up of intricate punch card machines was installed by the US Bureau of the Census in order to assist in compiling the findings of a recent national population census. Use of mechanical DP made it quicker and easier to search and compute the data than manual process.
(iii) Electronic DP: And finally, the electronic DP replaced the other two that resulted fall in mistakes and rising productivity. Data processing is being done electronically using computers and other cutting-edge electronics. It is now widely used in industry, research institutions and academia.
How data processing and data science is relevant for finance?
The relevance of data processing and data science in the area of finance is increasing every day. The eleven significant areas where data science play important role are:
(i) Risk analytics
Business inevitably involves risk, particularly in the financial industry. It is crucial to determine the risk factor before making any decisions. For example, a better method for defending the business against potential cybersecurity risks is risk analytics, which is determined through data science. Given that a large portion of a company’s risk-related data is “unstructured,” its analysis without data science methods can be challenging and prone to human mistake.
The importance of the loss and the regularity of its recurrence can aid in highlighting the precise regions that represent the maximum threat, allowing for the future avoidance of similar circumstances. Once a danger has been recognised, it may be prioritised and its recurrence closely watched.
Machine learning algorithms can look through historical transactions and general information to help banks analyse each customer’s reliability and trustworthiness and determine the relative risk of accepting or lending to them.
Similar to this, transaction data may be used to create a dynamic, real-time risk assessment model that responds immediately to any new transactions or modifications to client data.
(ii) Real time analytics
Prior to significant advances in Data Engineering (Airflow, Spark, and Cloud solutions), all data was historical in nature. Data engineers would discover significance in numbers that were days, weeks, months, or even years old since that was the only accessible information.
It was processed in batches, which meant that no analysis could be performed until a batch of data had been gathered within a predetermined timescale. Consequently, any conclusions drawn from this data were possibly invalid.
With technological advancement and improved hardware, real-time analytics are now available,as Data Engineering, Data Science, Machine Learning, and Business Intelligence work together to provide the optimal user experience. Thanks to dynamic data pipelines, data streams, and a speedier data transmission between source and analyzer, businesses can now respond quickly to consumer interactions. With real-time analysis, there are no delays in establishing a customer’s worth to an organisation, and credit ratings and transactions are far more precise.
(iii) Customer data management
Data science enables effective management of client data. In recent years, many financial institutions may have processed their data solely through the machine learning capabilities of Business Intelligence (BI). However, the proliferation of big data and unstructured data has rendered this method significantly less effective for predicting risk and future trends.
There are currently more transactions occurring every minute than ever before, thus there is better data accessibility for analysis. Due to the arrival of social media and new Internet of Things (IoT) devices, a significant portion of this data does not conform to the structure of organised data previously employed.
Using methods such as text analytics, data mining, and natural language processing, data science is wellequipped to deal with massive volumes of unstructured new data. Consequently, despite the fact that data availability has been enhanced, data science implies that a company’s analytical capabilities may also be upgraded, leading to a greater understanding of market patterns and client behaviour.
(iv) Consumer Analytics
in a world where choice has never been more crucial, it has become evident that each customer is unique; nonetheless, there have never been more consumers. This contradiction cannot be sustained without the intelligence and automation of machine learning.
It is as important to ensure that each client receives a customised service as it is to process their data swiftly and efficiently, without time-intensive individualised analysis.
As a consequence, insurance firms are using real-time analytics in conjunction with prior data patterns and quick analysis of each customer’s transaction history to eliminate sub-zero consumers, enhance crosssales, and calculate a consumer’s lifetime worth. This allows each financial institution to keep their own degree of security while still reviewing each application individually.
(v) Customer segmentation
Despite the fact that each consumer is unique, it is only possible to comprehend their behaviour after they have been categorised or divided. Customers are frequently segmented based on socioeconomic factors, such as geography, age, and buying patterns.
By examining these clusters collectively, organisations in the financial industry and beyond may assess a customer’s current and long-term worth. With this information, organisations may eliminate clients who provide little value and focus on those with promise.
To do this, data scientists can use automated machine learning algorithms to categorise their clients based on specified attributes that have been assigned relative relevance scores. Comparing these groupings to former customers reveals the expected value of time invested with each client.
(vi) Personalized services
The requirement to customise each customer’s experience extends beyond gauging risk assessment. Even major organisations strive to provide customised service to their consumers as a method of enhancing their reputation and increasing customer lifetime value. This is also true for businesses in the finance sector.
From customer evaluations to telephone interactions, everything can be studied in a way that benefits both the business and the consumer. By delivering the consumer a product that precisely meets their needs, cross-selling may be facilitated by a thorough comprehension of these interactions.
Natural language processing (NLP) and voice recognition technologies dissect these encounters into a series of important points that can identify chances to increase revenue, enhance the customer service experience, and steer the company’s future. Due to the rapid progress of NLP research, the potential is yet to be fully realised
(vii) Advanced customer service
Data science’s capacity to give superior customer service goes hand in hand with its ability to provide customised services. As client interactions may be evaluated in real-time, more effective recommendations can be offered to the customer care agent managing the customer’s case throughout the conversation. Natural language processing can offer chances for practical financial advise based on what the consumer is saying, even if the customer is unsure of the product they are seeking.The customer support agent can then cross-sell or up-sell while efficiently addressing the client’s inquiry. The knowledge from each encounter may then be utilised to inform subsequent interactions of a similar nature, hence enhancing the system’s efficacy over time
(viii) Predictive Analytics
Predictive analytics enables organisations in the financial sector to extrapolate from existing data and anticipate what may occur in the future, including how patterns may evolve. When prediction is necessary, machine learning is utilised. Using machine learning techniques, pre-processed data may be input into the system in order for it to learn how to anticipate future occurrences accurately.
More information improves the prediction model. Typically, for an algorithm to function in shallow learning, the data must be cleansed and altered. Deep learning, on the other hand, changes the data without the need for human preparation to establish the initial rules, and so achieves superior performance.
In the case of stock market pricing, machine learning algorithms learn trends from past data in a certain interval (may be a week, month, or quarter) and then forecast future stock market trends based on this historical information. This allows data scientists to depict expected patterns for end-users in order to assist them in making investment decisions and developing trading strategies.
(ix) Fraud detection
With a rise in financial transactions, the risk for fraud also increases. Tracking incidents of fraud, such as identity theft and credit card scams, and limiting the resulting harm is a primary responsibility for financial institutions. As the technologies used to analyse big data become more sophisticated, so do their capacity to detect fraud early on.
Artificial intelligence and machine learning algorithms can now detect credit card fraud significantly more precisely, owing to the vast amount of data accessible from which to draw trends and the capacity to respond in real time to suspect behaviour.
If a major purchase is made on a credit card belonging to a consumer who has traditionally been very frugal, the card can be immediately terminated, and a notification sent to the card owner.
This protects not just the client, but also the bank and the client’s insurance carrier. When it comes to trading, machine learning techniques discover irregularities and notify the relevant financial institution, enabling speedy inquiry.
(x) Anomaly detection
Financial services have long placed a premium on detecting abnormalities in a customer’s bank account activities, partly because anomalies are only proved to be anomalous after the event happens. Although data science can provide real-time insights, it cannot anticipate singular incidents of credit card fraud or identity theft.
However, data analytics can discover instances of unlawful insider trading before they cause considerable harm. The methods for anomaly identification consist of Recurrent Neural Networks and Long Short-Term Memory models. These algorithms can analyse the behaviour of traders before and after information about the stock market becomes public in order to determine if they illegally monopolised stock market forecasts and took advantage of investors. Transformers, which are next-generation designs for a variety of applications, including Anomaly Detection, are the foundation of more modern solutions.
(xi) Algorithmic trading
Algorithmic trading is one of the key uses of data science in finance. Algorithmic trading happens when an unsupervised computer utilising the intelligence supplied by an algorithm trade suggestion on the stock market. As a consequence, it eliminates the risk of loss caused by indecision and human error.
The trading algorithm used to be developed according to a set of stringent rules that decide whether it will trade on a specific market at a specific moment (there is no restriction for which markets algorithmic trading can work on).
This method is known as Reinforcement Learning, in which the model is taught using penalties and rewards associated with the rules. Each time a transaction proves to be a poor option, a model of reinforcement
learning ensures that the algorithm learns and adapts its rules accordingly.
One of the primary advantages of algorithmic trading is the increased frequency of deals. Based on facts and taught behaviour, the computer can operate in a fraction of a second without human indecision or thought. Similarly, the machine will only trade when it perceives a profit opportunity according to its rule set, regardless of how rare these chances may be.
(i) Validation:-
As per the UNECE glossary on statistical data editing (UNECE 2013), data validation may be defined as‘An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values.’
Simon (2013) defined data validation as “Data validation could be operationally defined as a process which ensures the correspondence of the final (published) data with a number of quality characteristics.”
A decision-making process called data validation leads to the acceptance or rejection of data as acceptable. Data is subjected to rules. Data are deemed legitimate for the intended final use if they comply with the rules, which means that the combination stated by the rules is not broken.
The objective of data validation is to assure a particular degree of data quality.
In official statistics, however, quality has multiple dimensions: relevance, correctness, timeliness and punctuality, accessibility and clarity, comparability, coherence, and comprehensiveness. Therefore, it is essential to determine which components data validation addresses.
(ii) Sorting:-
Data sorting is any procedure that organises data into a meaningful order to make it simpler to comprehend, analyse, and visualise. Sorting is a typical strategy for presenting research data in a manner that facilitates comprehension of
the story being told by the data. Sorting can be performed on raw data (across all records) or aggregated information (in a table, chart, or some other aggregated or summarised output).Summarization(statistical) or (automatic) involves reducing detailed data to its main points.
Typically, data is sorted in ascending or decreasing order based on actual numbers, counts, or percentages, but it may also be sorted based on variable value labels. Value labels are metadata present in certain applications that let the researcher to save labels for each value alternative in a categorical question. The vast majority of software programmes permit sorting by many factors. A data collection including region and nation fields, for instance, can be sorted by region as the main sort and subsequently by country. In each sorted region, the county sort will be implemented.
When working with any type of data, there are a number of typical sorting apps. One such use is data cleaning, which is the act of sorting data in order to identify anomalies in a data pattern. For instance, monthly sales data can be sorted by month to identify sales volume variations.
Sorting is also frequently used to rank or prioritise records. In this instance, data is sorted based on a rank, computed score, or other weighing factor (for example, highest volume accounts or heavy usage customers).
It is also vitally necessary to organise visualisations (tables, charts, etc.) correctly to facilitate accurate data interpretation. In market research, for instance, it is typical to sort the findings of a single-response question by column percentage, i.e. from most answered to least replied, as indicated by the following brand preference question.
Incorrect classification frequently results in misunderstanding. Always verify that the most logical sorts are used to every visualisation.
Using sorting functions is an easy idea to comprehend, but there are a few technical considerations to keep in mind. The arbitrary sorting of non-unique data is one such issue. Consider, for example, a data collection comprising region and nation variables, as well as several records per area. If a region-based sort is implemented, what is the default secondary sort? In other words, how will the data be sorted inside each region?
This depends on the application in question. Excel, for instance, will preserve the original sort as the default sort order following the execution of the primary sort. SQL databases do not have a default sort order. This rather depends on other variables, such the database management system (DBMS) in use, indexes, and other variables. Other programmes may perform extra sorting by default based on the column order.
In nearly every level of data processing, the vast majority of analytical and statistical software programmes offer a variety of sorting options.
(iii) Aggregation:-
Data aggregation refers to any process in which data is collected and summarised. When data is aggregated, individual data rows, which are often compiled from several sources, are replaced with summaries or totals. Groups of observed aggregates are replaced with statistical summaries based on these observations. A data warehouse often contains aggregate data since it may offer answers to analytical inquiries and drastically cut the time required to query massive data sets.
A common application of data aggregation is to offer statistical analysis for groups of individuals and to provide relevant summary data for business analysis. Utilizing software tools known as data aggregators, large-scale data aggregation is in commonplace. Typically, data aggregators comprise functions for gathering, processing, and displaying aggregated data.
Data aggregation enables analysts to access and analyse vast quantities of data in a reasonable amount of time. A single row of aggregate data may represent hundreds, thousands, or even millions of individual data entries. As data is aggregated, it may be queried rapidly as opposed to taking all processing cycles to acquire each individual data row and aggregate it in real time when it is requested or accessed.As the amount of data kept by businesses continues to grow, aggregating the most significant and often requested data can facilitate their efficient access.
(iv) Analysis:-
Data analysis is described as the process of cleaning, converting, and modelling data to obtain actionable business intelligence. The objective of data analysis is to extract relevant information from data and make decisions based on this knowledge.
Every time we make a decision in our day-to-day lives, we consider what occurred previously or what would occur if we choose a specific option. This is a simple example of data analysis. This is nothing more than studying the past or the future and basing judgments on that analysis. We do so by recalling our history or by imagining our future. That consists solely of data analysis. Now, the same task that an analyst does for commercial goals is known as Data Analysis
Analysis is sometimes all that is required to expand your business and finance.
If any firm is not expanding, it must admit past errors and create a new plan to avoid making the same mistakes. And even if the firm is expanding, it must anticipate making it expand even more. All that is required is an analysis of the business data and operations.
(v) Reporting:-
Data reporting is the act of gathering and structuring raw data and turning it into a consumable format in order to evaluate the organisation’s continuous performance.
The data reports can provide answers to fundamental inquiries regarding the status of the firm. They can display the status of certain data within an Excel document or a simple data visualisation tool. Static data reports often employ the same structure throughout time and collect data from a single source.
A data report is nothing more than a set of documented facts and numbers. Consider the population count as an illustration. This is a technical paper conveying basic facts on the population and demographics of a country. It may be presented in text or in a graphical manner, such as a graph or chart. However, static information may be utilised to evaluate present situations.
Financial data such as revenues, accounts receivable, and net profits are often summarised in a company’s data reporting. This gives an up-to-date record of the company’s financial health or a portion of the finances, such as sales. A sales director may report on KPIs based on location, funnel stage, and closing rate in order to present an accurate view of the whole sales pipeline.
Data provides a method for measuring development in many aspects of our life. It influences both our professional judgments and our day-to-day affairs. A data report would indicate where we should devote the most time and money, as well as what need more organisation or attention.
In any industry, accurate data reporting plays a crucial role. Utilizing business information in healthcare enables physicians to provide more effective and efficient patient care, hence saving lives. In education, data reports may be utilised to study the relationship between attendance records and seasonal weather patterns, as well as the intersection of acceptance rates and neighbourhood regions.
The most effective business analysts possess specific competencies. An outstanding business analyst must be able to prioritise the most pertinent data. There is no space for error in data reporting, which necessitates high thoroughness and attention to detail. The capacity to comprehend and organise enormous volumes of information is another valuable talent. Lastly, the ability to organise and present data in an easy-to-read fashion is essential for all data reporters.
Excellence in data reporting does not necessitate immersion in coding or proficiency in analytics. Other necessary talents include the ability to extract vital information from data, to keep things simple, and to prevent data hoarding.
Although static reporting can be precise and helpful, it has limitations. One such instance is the absence of realtime insights. If confronted with a vast volume of data to organise into a usable and actionable format, a report enables senior management or the sales team to provide guidance on future steps. However, if the layout, data, and formulae are not given in a timely way, they may be out of current context.
The reporting of data is vital to an organisation’s business intelligence. The more is an organisation’s access to data, the more agile it may be. This can help a firm to maintain its relevance in a market that is becoming increasingly competitive and dynamic. An efficient data reporting system will facilitate the formation of judicious judgments that might steer a business in new areas and provide additional income streams.
(vi) Classification:-
Data classification is the process of classifying data according to important categories so that it may be utilised and safeguarded more effectively. The categorization process makes data easier to identify and access on a fundamental level. Regarding risk management, compliance, and data security, the classification of data is of special relevance.
Classifying data entails labelling it to make it searchable and trackable. Additionally, it avoids many duplications of data, which can minimise storage and backup expenses and accelerate the search procedure. The categorization process may sound very technical, yet it is a topic that your organisation’s leadership must comprehend.
The categorization of data has vastly improved over time. Today, the technology is employed for a number of applications, most frequently to assist data security activities. However, data may be categorised for a variety of purposes, including facilitating access, ensuring regulatory compliance, and achieving other commercial or personal goals. In many instances, data classification is a statutory obligation, since data must be searchable and retrievable within predetermined deadlines. For the purposes of data security, data classification is a useful strategy that simplifies the application of appropriate security measures based on the kind of data being accessed, sent, or duplicated.
Classification of data frequently entails an abundance of tags and labels that identify the data’s kind, secrecy, and integrity. In data classification procedures, availability may also be taken into account. It is common practise to classify the sensitivity of data based on changing levels of relevance or secrecy, which corresponds to the security measures required to safeguard each classification level.
Three primary methods of data classification are recognised as industry standards:
● Classification based on content, examines and interprets files for sensitive data.
● Context-based classification considers, among other characteristics, application, location, and creator as indirect markers of sensitive information.
● User-based classification relies on the human selection of each document by the end user. To indicate sensitive documents, user-based classification depends on human expertise and judgement during document creation, editing, review, or distribution.
In addition to the classification kinds, it is prudent for an organisation to identify the relative risk associated with the data types, how the data is handled, and where it is stored/sent (endpoints). It is standard practise to divide data and systems into three risk categories.
Low risk: If data is accessible to the public and recovery is simple, then this data collection and the mechanisms around it pose a smaller risk than others.
Moderate risk: Essentially, they are non-public or internal (to a business or its partners) data. However, it is unlikely to be too mission-critical or sensitive to be considered “high risk.” The intermediate category may include proprietary operating processes, cost of products, and certain corporate paperwork.
High risk: Anything even vaguely sensitive or critical to operational security falls under the category of high risk. Additionally, data that is incredibly difficult to retrieve (if lost). All secret, sensitive, and essential data falls under the category of high risk.
Data creation and labelling may be simple for certain companies. If there are not a significant number of data kinds or if your firm has fewer transactions, it will likely be easier to determine the risk of your data and systems. However, many businesses working with large volumes or numerous types of data will certainly require a thorough method for assessing their risk. Many utilise a “data categorization matrix” for this purpose.
Creating a matrix that rates data and/or systems based on how likely they are to be hacked and how sensitive the data is enables you to rapidly identify how to classify and safeguard all sensitive information.
Risk | Confidential Data | Sensitive Data | Public |
High | Medium | Low | |
General | |||
Institution Impact | The negative impact on the institution should this data be incorrect, improperly disclosed, or not available when needed is typically very high. | The risk for negative impact on the institution should this information not be available when needed is typically moderate. | The impact on the institution should Public data not be available is Typically low, (inconvenient but not deliberating). |
Description |
Access to Confidential institutional data must be controlled fromcreation to |
Access to Sensitive institutional data must be requested from, and authorized by, the Functional Security Module Representative who |
Access to Public institutional data may be granted to any requester, or it is published with no restrictions.Public data is not considered sensitive.The integrity of “Public’ data should be protected, and the appropriate Functional Security Module Representative must authorise replication or copying of the data in order to ensure it remains accurate overtime |
Access | Only those individuals designated with approved access. - | EMU employees and nonemployees who have a business need to know | EMU affiliates and general public with a need to know |
Data may be classified as Restricted, Private, or Public by an entity. In this instance, public data are the least sensitive and have the lowest security requirements, whereas restricted data are the most sensitive and have the highest security rating. This form of data categorization is frequently the beginning point for many organisations, followed by subsequent identification and tagging operations that label data based on its enterprise-relatedness, quality, and other categories. The most effective data classification methods include follow-up processes and frameworks to ensure that sensitive data remains in its proper location.
Classifying data may be a difficult and laborious procedure. Automated systems can assist in streamlining the process, but an organisation must determine the categories and criteria that will be used to classify data, understand and define its objectives, outline the roles and responsibilities of employees in maintaining proper data classification protocols, and implement security standards that correspond with data categories and tags. This procedure will give an operational framework to workers and third parties engaged in the storage, transfer, or retrieval of data, if carried out appropriately.
Policies and procedures should be well-defined, respectful of security needs and the confidentiality of data kinds, and simple enough for staff encouraging compliance to comprehend. For example, each category should include information about the types of data included in the categorization, security concerns including rules for accessing, transferring, and keeping data, and the potential risks associated with a security policy breach.
Steps for effective data classification
Understanding the current setup: Taking a comprehensive look at the location of the organisation’s current data and any applicable legislation is likely the best beginning point for successfully classifying data. Before one classifies data, one must know what data he is having.
Creation of a data classification policy: Without adequate policy, maintaining compliance with data protection standards in an organisation is practically difficult. Priority number one should be the creation of a policy.
Prioritize and organize data: Now that a data classification policy is in place, it is time to categorise the data. Based on the sensitivity and privacy of the data, the optimal method to be chosen for tagging it.
Data organisation is the classification of unstructured data into distinct groups. This raw data comprises variables’ observations. As an illustration of data organisation, the arrangement of students’ grades in different topics is one example.
As time passes and the data volume grows, the time required to look for any information from the data source would rise if it has not previously been structured.
Data organisation is the process of arranging unstructured data in a meaningful manner. Classification, frequency distribution tables, image representations, graphical representations, etc. are examples of data organisation techniques.
Data organisation allows us to arrange data in a manner that is easy to understand and manipulate. It is challenging to deal with or analyse raw data.
IT workers utilise the notion of data organisation in several ways. Many of these are included under the umbrella term “data management.” For instance, data organisation includes reordering or assessing the arrangement of data components in a physical record.
The analysis of somewhat organised and unstructured data is another crucial component of business data organisation. Structured data consists of tabular information that may be readily imported into a database and then utilised by analytics software or other applications. Unstructured data are raw and unformatted data, such as a basic text document with names, dates, and other information spread among random paragraphs. The integration of somewhat unstructured data into a holistic data environment has been facilitated by the development of technical tools and resources.
In a world where data sets are among the most valuable assets possessed by firms across several industries, businesses employ data organisation methods in order to make better use of their data assets. Executives and other professionals may prioritise data organisation as part of a complete plan to expedite business operations, boost business intelligence, and enhance the business model as a whole.
The examination of both relatively organised and unstructured data is a crucial component of business data organisation. Structured data consists of tabular information that can be readily incorporated into a database and supplied to analytics software or other specific applications. Unstructured data is regarded raw and unformatted, similar to a plain text document in which information is dispersed across random paragraphs. Few specialists have built technological tools and resources to manage substantially unstructured data. These data are incorporated into a comprehensive data ecosystem. Businesses implement data organisation techniques to make better use of their data assets. Data assets have a very significant position in the world, since they are owned by businesses in a variety of industries. Data organisation is seen as a component of a holistic strategy that facilitates the streamlining of business operations, whether via the acquisition of superior business information or the overall improvement of a business model.
Data distribution is a function that identifies and quantifies all potential values for a variable, as well as their relative frequency (probability of how often they occur). Any population with dispersed data is categorised as a distribution. It is necessary to establish the population’s distribution type in order to analyse it using the appropriate statistical procedures.
Statistics makes extensive use of data distributions. If an analyst gathers 500 data points on the shop floor, they are of little use to management unless they are categorised or organised in an usable manner. The data distribution approach arranges the raw data into graphical representations (such as histograms, box plots, and pie charts, etc.) and gives relevant information.
The primary benefit of data distribution is the estimation of the probability of any certain observation within a sample space. Probability distribution is a mathematical model that determines the probabilities of the occurrence of certain test or experiment outcomes. These models are used to specify distinct sorts of random variables (often discrete or continuous) in order to make a choice. One can employ mean, mode, range, probability, and other statistical approaches based on the category of the random variable.
Distributions are basically classified based on the type of data:
(i) Discrete distributions: A discrete distribution that results from countable data and has a finite number of potential values. In addition, discrete distributions may be displayed in tables, and the values of the random variable can be counted. Example: rolling dice, selecting a specific amount of heads, etc. Following are the discrete distributions of various types:
(a) Binomial distributions: The binomial distribution quantifies the chance of obtaining a specific number of successes or failures each experiment.
Example: When tossing a coin: The likelihood of a coin falling on its head is one-half and the probability of a coin landing on its tail is one-half
(b) Poisson distribution: The Poisson distribution is the discrete probability distribution that quantifies the chance of a certain number of events occurring in a given time period, where the events occur in a well-defined order.
Poisson distribution applies to attributes that can potentially take on huge values, but in practise take on tiny ones.
Example: Number of flaws, mistakes, accidents, absentees etc.
(c) Hypergeometric distribution: The hypergeometric distribution is a discrete distribution that assesses the chance of a certain number of successes in (n) trials, without replacement, from a sufficiently large population (N). Specifically, sampling without replacement.
The hypergeometric distribution is comparable to the binomial distribution; the primary distinction between the two is that the chance of success is not the same for all trials in the binomial distribution but it is in the hypergeometric distribution.
(d) Geometric distribution: The geometric distribution is a discrete distribution that assesses the probability of the occurrence of the first success. A possible extension is the negative binomial distribution.
Example: A marketing representative from an advertising firm chooses hockey players from several institutions at random till he discovers an Olympic participant.
(ii) Continuous distributions: A distribution with an unlimited number of (variable) data points that may be represented on a continuous measuring scale. A continuous random variable is a random variable with an unlimited and uncountable set of potential values. It is more than a simple count and is often described using probability density functions (pdf). The probability density function describes the characteristics of a random variable. Normally clustered frequency distribution is seen. Therefore, the probability density function views it as the distribution’s “shape.”
Following are the continuous distributions of various types:
(i) Normal distribution: Gaussian distribution is another name for normal distribution. It is a bell-shaped curve with a greater frequency (probability density) around the core point. As values go away from the centre value on each side, the frequency drops dramatically.
In other words, features whose dimensions are expected to fall on either side of the target value with equal likelihood adhere to normal distribution.
(ii) Lognormal distribution: A continuous random variable x follows a lognormal distribution if the distribution of its natural logarithm, ln(x), is normal.
As the sample size rises, the distribution of the sum of random variables approaches a normal distribution, independent of the distribution of the individuals.
(iii) F distribution: The F distribution is often employed to examine the equality of variances between two normal populations.
The F distribution is an asymmetric distribution with no maximum value and a minimum value of 0. The curve approaches 0 but never reaches the horizontal axis.
(iv) Chi square distributions: When independent variables with standard normal distribution are squared and added, the chi square distribution occurs.
Example: y = Z12+ Z22 +Z32 +Z42+....+ Zn2 if Z is a typical normal random variable.
The distribution of chi square values is symmetrical and constrained below zero. And approaches the form of the normal distribution as the number of degrees of freedom grows.
(v) Exponential distribution: The exponential distribution is a probability distribution and one of the most often employed continuous distributions. Used frequently to represent products with a consistent failure rate.
The exponential distribution and the Poisson distribution are closely connected. Has a constant failure rate since its form characteristics remain constant.
(vi) T student distribution: The t distribution or student’s t distribution is a probability distribution with a bell shape that is symmetrical about its mean.
Used frequently for testing hypotheses and building confidence intervals for means. Substituted for the normal distribution when the standard deviation cannot be determined.
When random variables are averages, the distribution of the average tends to be normal, similar to the normal distribution, independent of the distribution of the individuals.
Data Cleaning
Data cleaning is the process of correcting or deleting inaccurate, corrupted, improperly formatted, duplicate, or insufficient data from a dataset. When several data sources are combined, there are numerous chances for data duplication and mis-labelling. Incorrect data renders outcomes and algorithms untrustworthy, despite their apparent accuracy. There is no, one definitive method for prescribing the precise phases of the data cleaning procedure, as the methods differ from dataset to dataset. However, it is essential to build a template for your data cleaning process so that you can be certain you are always doing the steps correctly.
Data cleaning is different from data transformation. Data cleaning is the process of removing irrelevant data from a dataset. The process of changing data from one format or structure to another is known as data transformation. Transformation procedures are sometimes known as data wrangling or data munging, since they map and change “raw” data into another format for warehousing and analysis.
Steps for data cleaning
(i) Step 1: Removal of duplicate and irrelevant information
Eliminate unnecessary observations from your dataset, such as duplicate or irrelevant observations. Most duplicate observations will occur during data collecting. When you merge data sets from numerous sites, scrape data, or get data from customers or several departments, there are potential to produce duplicate data. De-duplication is one of the most important considerations for this procedure. Observations are deemed irrelevant when they do not pertain to the specific topic you are attempting to study. For instance, if you wish to study data pertaining to millennial clients but your dataset contains observations pertaining to earlier generations, you might exclude these useless observations. This may make analysis more effective and reduce distractions from your core objective, in addition to producing a more manageable and effective dataset.
(ii) Step 2: Fix structural errors:
When measuring or transferring data, you may detect unusual naming standards, typos, or wrong capitalization. These contradictions may lead to mislabeled classes or groups. For instance, “N/A” and “Not Applicable” may both be present, but they should be examined as a single category.
(iii) Step 3: Filter unwanted outliers:
Occasionally, you will encounter observations that, at first look, do not appear to fit inside the data you are evaluating. If you have a valid cause to eliminate an outlier, such as erroneous data input, doing so will improve the performance of the data you are analysing. Occasionally, though, the arrival of an outlier will prove a notion you’re working on. Remember that the existence of an outlier does notimply that it is erroneous. This step is required to validate the number. Consider deleting an outlier if it appears to be unrelated to the analysis or an error.
(iv) Step 4: Handle missing data
Many algorithms do not accept missing values, hence missing data cannot be ignored. There are several approaches to handle missing data. Although neither is desirable, both should be explored.
As a first alternative, the observations with missing values may be dropped, but doing so may result in the loss of information. This should be kept in mind before doing so.
As a second alternative, the missing numbers may be entered based on other observations. Again, there is a chance that the data’s integrity may be compromised, as action may be based on assumptions rather than real observations.
(v) Step 5: Validation and QA
As part of basic validation, one should be able to answer the following questions at the conclusion of the data cleaning process:
(a) Does the data make sense?
(b) Does the data adhere to the regulations applicable to its field?
(c) Does it verify or contradict your working hypothesis, or does it shed any light on it?
(d) Can data patterns assist you in formulating your next theory?
(e) If not, is this due to an issue with data quality?
False assumptions based on inaccurate or “dirty” data can lead to ineffective company strategies and
decisions. False conclusions might result in an uncomfortable moment at a reporting meeting when
it is shown that the data does not withstand inspection. Before reaching that point, it is essential to
establish a culture of data quality inside the firm. To do this, one should specify the methods that may
be employed to establish this culture and also the definition of data quality.
Benefits of quality data
Determining the quality of data needs an analysis of its properties and a weighting of those attributes based on what is most essential to the company and the application(s) for which the data will be utilised.
Main characteristics of quality data are:
(i) Validity
(ii) Accuracy
(iii) Completeness
(iv) Consistency
Benefits of data cleaning
Ultimately, having clean data would boost overall productivity and provide with the greatest quality information for decision-making. Benefits include:
(i) Error correction when numerous data sources are involved.
(ii) Fewer mistakes result in happier customers and less irritated workers.
(iii) Capability to map the many functions and planned uses of your data.
(iv) Monitoring mistakes and improving reporting to determine where errors are originating can make it easier to repair inaccurate or damaged data in future applications.
(v) Using data cleaning technologies will result in more effective corporate procedures and speedier decision-making.
Data validation
Data validation is a crucial component of any data management process, whether it is about collecting information in the field, evaluating data, or preparing to deliver data to stakeholders. If the initial data is not valid, the outcomes will not be accurate either. It is therefore vital to check and validate data before using it.
Although data validation is an essential stage in every data pipeline, it is frequently ignored. It may appear like data validation is an unnecessary step that slows down the work, but it is vital for producing the finest possible outcomes. Today, data validation may be accomplished considerably more quickly than may have imagined earlier. With data integration systems that can include and automate validation procedures, validation may be considered as an integral part of the workflow, as opposed to an additional step.
Validating the precision, clarity, and specificity of data is essential for mitigating project failures. Without data validation, one may into run the danger of basing judgments on faulty data that is not indicative of the current situation.
In addition to validating data inputs and values, it is vital to validate the data model itself. If the data model is not appropriately constructed or developed, one may encounter problems while attempting to use data files in various programmes and software.
The format and content of data files will determine what can be done with the data. Using validation criteria to purify data before to usage mitigates “garbage in, garbage out” problems. Ensuring data integrity contributes to the validity of the conclusions.
Types of data validation
~ Data type check: A data type check verifies that the entered data has the appropriate data type. For instance, a field may only take numeric values. If this is the case, the system should reject any data containing other characters, such as letters or special symbols.
~ Code check: A code check verifies that a field’s value is picked from a legitimate set of options or that it adheres to specific formatting requirements. For instance, it is easy to verify the validity of a postal code by comparing it to a list of valid codes. The same principle may be extended to other things, including nation codes and NIC industry codes.
~ Range check: A range check determines whether or not input data falls inside a specified range. Latitude and longitude, for instance, are frequently employed in geographic data. A latitude value must fall between -90 and 90 degrees, whereas a longitude value must fall between -180 and 180 degrees. Outside of this range, values are invalid.
~ Format check: Numerous data kinds adhere to a set format. Date columns that are kept in a fixed format, such as “YYYY-MM-DD” or “DD-MM-YYYY,” are a popular use case. A data validation technique that ensures dates are in the correct format contributes to data and temporal consistency.
~ Consistency check: A consistency check is a form of logical check that verifies that the data has been input in a consistent manner. Checking whether a package’s delivery date is later than its shipment date is one example.
~ Uniqueness check: Some data like PAN or e-mail ids are unique by nature. These fields should typically contain unique items in a database. A uniqueness check guarantees that an item is not put into a database numerous times.
Consider the case of a business that collects data on its outlets but neglects to do an appropriate postal code verification. The error might make it more challenging to utilise the data for information and business analytics. Several issues may arise if the postal code is not supplied or is typed incorrectly.
In certain mapping tools, defining the location of the shop might be challenging. A store’s postal code will also facilitate the generation of neighborhood-specific data. Without a postal code data verification, it is more probable that data may lose its value. If the data needs to be recollected or the postal code needs to be manually input, further expenses will also be incurred.
A straightforward solution to the issue would be to provide a check that guarantees a valid postal code is entered. The solution may be a drop-down menu or an auto-complete form that enables the user to select a valid postal code from a list. This kind of data validation is referred to as a code validation or code check.
Solved Case 1
Maitreyee is working as a data analyst with a financial organisation. She is supplied with a large amount of data, and she plans to use statistical techniques for inferring some useful information and knowledge from it. But, before starting the process of data analysis, she found that the provided data is not cleaned. She knows that before applying the data analysis tools, cleaning the data is essential.
In your opinion, what steps Maitreyee should follow to clean the data, and what are the benefits of clean data.
Teaching note - outline for solution:
The instructor may initiate the discussions with explaining the concept of data cleaning and about the importance of data cleaning.
The instructor may also elaborate the consequences of using an uncleaned dataset on the final analysis. She may discuss the steps five steps of data cleaning in detail, such as,
(i) Removal of duplicate and irrelevant information
(ii) Fix structural errors
(iii) Filter unwanted outliers
(iv) Handle missing data
(v) Validation and QA
At the outset, Maitreyee should focus on answering the following questions:
(a) Does the data make sense?
(b) Does the data adhere to the regulations applicable to its field?
(c) Does it verify or contradict your working hypothesis, or does it shed any light on it?
(d) Can data patterns assist you in formulating your next theory?
(e) If not, is this due to an issue with data quality?
The instructor may close the discussions with explaining the benefits of using clean data, such as:
(i) Validity
(ii) Accuracy
(iii) Completeness
(iv) Consistency
1. Data science plays an important role in
(a) Risk analytics
(b) Customer data management
(c) Consumer analytics
(d) All of the above
Answer:- d. All of the above
Choice "D" is correct as Data science plays an important role in Risk analytics, Customer data management & Consumer analytics.
Data science is a vital tool in the fields of risk analytics, customer data management, and consumer analytics.
In risk analytics, data science is used to identify potential risks and to develop strategies for managing them. By analyzing large amounts of data, data scientists can identify patterns and trends that may signal impending risks, and develop models to predict future outcomes. For example, financial institutions use data science to predict credit risk and to identify potential fraudulent activity.
Overall, data science plays a crucial role in these fields by enabling organizations to analyze large amounts of data and develop insights that can inform strategic decision-making.
2. The primary benefit of data distribution is
(a) the estimation of the probability of any certain observation within a sample space
(b) the estimation of the probability of any certain observation within a non-sample space
(c) the estimation of the probability of any certain observation within a population
(d) the estimation of the probability of any certain observation without a non-sample space
Answer:- a. the estimation of the probability of any observation within a sample space
3. Binomial distribution applies to attributes
(a) that are categorised into two mutually exclusive and exhaustive classes
(b) that are categorised into three mutually exclusive and exhaustive classes
(c) that are categorised into less than two mutually exclusive and exhaustive classes
(d) that are categorised into four mutually exclusive and exhaustive classes
Answer:- a. that are categorised into two mutually exclusive and exhaustive classes
Choice "A" is correct as
binomial distribution applies to attributes that are categorized into two mutually exclusive and exhaustive classes.
The binomial distribution is a probability distribution that is commonly used when we have a binary outcome, where an event can either occur or not occur. the binomial distribution applies to attributes that are categorized into two mutually exclusive and exhaustive classes, and it is a useful tool for calculating the probability of getting a certain number of successes in a fixed number of trials.
For example, if we are flipping a coin and want to know the probability of getting 3 heads in 5 flips, we can use the binomial distribution to calculate this probability.
4. The geometric distribution is a discrete distribution that assesses
(a) the probability of the occurrence of the first success
(b) the probability of the occurrence of the second success
(c) the probability of the occurrence of the third success
(d) the probability of the occurrence of the less success
Answer:- a. the probability of the occurrence of the first success
5. The probability density function describes
(a) the characteristics of a random variable
(b) the characteristics of a non-random variable
(c) the characteristics of a random constant
(d) the characteristics of a non-random constant
Answer:- a. the characteristics of a random variable
Choice "A" is correct as the probability density function describes the characteristics of a random variable.
Continuous distributions: A distribution with an unlimited number of (variable) data points that may be represented on a continuous measuring scale. A continuous random variable is a random variable with an unlimited and uncountable set of potential values. It is more than a simple count and is often described using probability density functions (pdf). The probability density function describes the characteristics of a random variable. Normally clustered frequency distribution is seen. Therefore, the probability density function views it as the distribution’s “shape.”
1. Data validation could be operationally defined as a process which ensures the correspondence of the final (published) data with a number of quality characteristics. Answer:- T
2. Data analysis is described as the process of cleaning, converting, and modelling data to obtain actionable business intelligence. Answer:- T
3. Financial data such as revenues, accounts receivable, and net profits are often summarised in a company’s data reporting. Answer:- T
4. Structured data consists of tabular information that may be readily imported into a database and then utilised by analytics software or other applications. Answer:- T
5. Data distribution is a function that identifies and quantifies all potential values for a variable, as well as their relative frequency (probability of how often they occur). Answer:- T
1. Data may be classified as Restricted Private or Public by an entity.
2. Data organisation is the Classification of unstructured data into distinct groups.
3. Classification, frequency distribution tables Image representation graphical representations, etc. are
examples of data organisation techniques.
4. The distribution or student’s distribution is a probability distribution with a bell shape that is symmetrical
about it’s mean
5. Data cleaning is the process of correcting or deleting inaccurate, corrupted, improperly formatted, duplicate,
or insufficient data from a dataset
1. Briefly discuss about the role of data analysis in fraud detection?
Answer:-
2. Discuss the difference between discrete distribution and continuous distribution?
Answer:-
3. Write a short note on binomial distribution?
Answer:-
4. What is the significance of data cleaning?
Answer:-
5. Write a short note on ‘predictive analytics?
Answer:-
1. Elaborately discuss the functions of data analysis?
Answer:-
2. Elaborately discuss the various steps involved in data cleaning?
Answer:-
3. Discuss the benefits of ‘data cleaning?
Answer:-
4. How data processing and data science is relevant for finance?
Answer:-
5. Discuss the steps for effective data classification?
Answer:-
Unsolved Case
References:
● Davy Cielen, Arno D B Meysman, and Mohamed Ali. Introducing Data Science. Manning Publications Co USA
● Cathy O’Neil, Rachell Schutt. Doing data science. O’Reilley
● Joel Grus. Data science from scratch. O’Reilley
● www.tableau.com
● www.corporatefinanceinstitute.com
● Tyler McClain. Data analysis and reporting. Orgsync
● Marco Di Zio, Nadežda Fursova, Tjalling Gelsema, Sarah Gießing, Ugo Guarnera, Jūratė Petrauskienė, Lucas Quensel-von Kalben, Mauro Scanu, K.O. ten Bosch, Mark van der Loo, Katrin Walsdorfer. Methodology ofdata validation
● Barbara S Hawkins, and Stephen W Singer. Design, development and implementation of data processing system for multiple control trials and epidemiologic studies. Controlled clinical trials (1986)
Ruchika Ma'am has been a meritorious student throughout her student life. She is one of those who did not study from exam point of view or out of fear but because of the fact that she JUST LOVED STUDYING. When she says - love what you study, it has a deeper meaning.
She believes - "When you study, you get wise, you obtain knowledge. A knowledge that helps you in real life, in solving problems, finding opportunities. Implement what you study". She has a huge affinity for the Law Subject in particular and always encourages student to - "STUDY FROM THE BARE ACT, MAKE YOUR OWN INTERPRETATIONS". A rare practice that you will find in her video lectures as well.
She specializes in theory subjects - Law and Auditing.
Yash Sir (As students call him fondly) is not a teacher per se. He is a story teller who specializes in simplifying things, connecting the dots and building a story behind everything he teaches. A firm believer of Real Teaching, according to him - "Real Teaching is not teaching standard methods but giving the power to students to develop his own methods".
He cleared his CA Finals in May 2011 and has been into teaching since. He started teaching CA, CS, 11th, 12th, B.Com, M.Com students in an offline mode until 2016 when Konceptca was launched. One of the pioneers in Online Education, he believes in providing a learning experience which is NEAT, SMOOTH and AFFORDABLE.
He specializes in practical subjects – Accounting, Costing, Taxation, Financial Management. With over 12 years of teaching experience (Online as well as Offline), he SURELY KNOWS IT ALL.