Are you a leader or a laggard or an aspirant in AI?
This is a subject close to my heart because I focus my teaching / research and consulting towards the leader end or AI – where the competitive advantages create exponential gains for companies and people
Here are the key takeaways from this article that resonate with me
If you want to be an AI leader, you should be paying attention to this.
Also, if you are working for a company, you should try and see if they aspire to be a leader or an also ran.
I think in a decade, just like we are seeing with the retail industry, many of these also-rans who do not invest in AI for competitive advantage will not exist.
Where many companies tire of marginal gains from early AI efforts, the most successful recognize that the real breakthroughs in AI learning and scale come from persisting through the arduous phases.
Key lessons from AI leaders: Fund aggressively when conditions for success are in place; Build density in domains; Bring a rounded set of skills and invest in productivity; Speed execution with iterative releases; Win the front line
Many organizations underestimate what it takes to sow true gains, be it selecting the right seeds, apportioning the right investment, or having a mindset willing to put up with the vagaries of the crop cycle.
But for those that persevere, the rewards can be huge.
McKinsey research finds that leading organizations that approach the AI journey in the right ways and stick with it through the tough patches generate three to four times higher returns from their investments.
These AI leaders get on a different performance trajectory from the outset because they understand that AI is about mastering the long haul.
They prepare for that journey by anticipating the types of things that will make it easier to navigate the ups and downs, such as feedback loops that allow data quality and user adoption to compound and AI investments to become self-boosting.
Where some companies tire of marginal gains from weeks of effort, leaders recognize that the real breakthroughs in AI learning and scale come from working through those small steps.
But only a small number of businesses (10%) have figured out how to make AI work in these ways. The rest remain mired in the low to middling stages of maturity, with laggards making up 60 percent of the population and aspirants 30 percent
Top performers recognize that most of the impact comes from the last 20% of the journey
Leaders get disproportional impact from their AI investments.
The window of opportunity for underperformers is
Rather than dabbling in lots of different areas, they build strength and density in one or two domains, then expand from there. That approach allows them to deepen their use and application of unstructured data, access more sophisticated use cases, and layer in the necessary operational underpinnings—the investment, talent, data management, production, and other techniques that allow AI-enabled practices to become embedded into everyday routines
Moreover, as leaders build domain strength and reach a certain threshold in AI performance, their rate of learning and productivity increases, allowing them to progress through other domains faster and tackle problems of ever-increasing difficulty.
They recognize that scaling AI solutions to deal with increasingly sophisticated problems is hard, but necessary to capture value. Teaching a machine to identify human faces is one milestone, for instance, but getting the machine to recognize particular faces and only those faces is a far more complex undertaking.
Once solved, companies gain compounding benefits quickly.
While some of this can be seen as consultants prodding companies to action , to many of us, none of this is new. We have already seen how companies like Amazon are reaping exponential gains due to their investment in technology and AI
There is an early mover advantage in AI and companies who aspire to take a leadership position will gain exponential benefits in comparison to the laggards and the aspirants
This blog is the second part of a two-blog series. Here, we discuss different sectors where reinforcement learning can be used to solve complex problems efficiently. The blog is based on thispaper. In theprevious post, we studied the basics of reinforcement learning and how one can think of a problem as a Reinforcement Learning problem. In this follow-on post, we look at how real-world reinforcement learning applications can be developed.
In general, to formulate any reinforcement learning problem, we need to define the environment, the agent, states, actions, and rewards. This idea forms the basis for the examples in this post.
We cover reinforcement learning for –
Recommender systems
Energy management
Finance
Transportation
Healthcare
Recommender systems
Algorithms for recommendation systems are constantly evolving and reinforcement learning techniques play a key part in recommendation algorithms. Recommender systems face some unique challenges which can be addressed using reinforcement learning techniques. These challenges are
The idiosyncratic nature of actions and
A high degree of unobservability and
Stochasticity in preferences, activities, personality etc.
Horizon is Facebook’s open source applied reinforcement learning platform for recommendations. We formulate the illustrations in the figure below in terms of reinforcement learning as –
Action is sending or dropping of a notification.
Agentis the recommender system
Rewardsare the interactions and activities on Facebook.
Environmentis the user and the news.
Statespace is ongoing interaction and engagement of the user and the news along with the features representing the candidate to be notified.
Cooling is quite an essential process for data centersin order to lower high temperatures and conserve energy. Reinforcement learning can be efficiently used for data center cooling. Ideally MPC (Model-predictive method) is used to monitor or regulate the temperature and airflow for the components in the data center, such as the fan speeds, water flow regulators, air handling units (AHUs) etc. This problem can be solved as a reinforcement learning technique as –
The agentwill be the controller.The agent learns a linear model of the data center operations with random, safe exploration, with little or no prior knowledge.
The variables used to manipulate (ex: fan speed to control airflow, valve opening, etc.) representthe controls or actions
The rewardis the cost of a trajectory
The process variables to predict and regulate (differential air pressure, cold-aisle temperature, entering air temperature to each AHU, leaving air temperature (LAT) from each AHU, etc.) representthe state.
The agent optimizesthe cost (reward)of a trajectory based on the predicted model and generates actions at each step to mitigate the effect of model error catering for unexpected disturbances.
For reinforcement learning purposes, the data center is modelled as a control loop for cooling processes. The figure below illustrates the process.
Finance
I have done a good amount of work in the financial sector and with the same domain knowledge, I think there are multiple problems that can be modelled as sequential decision problems in the financial sector. Reinforcement learning can be employed for some of these, which include problems such as option pricing, portfolio optimization, risk management, etc.
In case of option pricing, the challenge comes with determining the right price for the option. To formulate option pricing as a reinforcement learning problem, we again define states, actions, and rewards as below –
The uncertainties affecting the option price are captured as part ofthe state. These include financial and economic factors such as interest rates.
The actionscould be the act of exercising the option.
The rewardis the intrinsic value of the option due to the change in state.
When we model option pricing as a reinforcement learning problem, the entire training or process depends on learning the state-action-value function.
Transportation
Reinforcement Learning aims to improve efficiency and reduce cost for its applications in the transportation sector. Order dispatching process in ridesharing systems is one of the best applications of RL in transportation (example – Uber). The process of allocating a driver to a passenger is a complex process and depends on various factors such as demand prediction, route planning, fleet management, etc. The problem of order dispatching includes both spatial and temporal components. This problem could be formulated as a reinforcement learning problem where:
The stateis composed of a driver’s geographical status, the raw timestamp, and the contextual feature vector (ex: driver service statistics, holiday indicators).
An optionrepresents state change for a driver in multiple time steps.
A policyrepresents the probability of taking an option in a state. The RL algorithm aims to estimate an optimal policy and/or its value function.
Composition and workflow of the order dispatching simulator. (from Tang et al. (2019))
The model is initialized using historical data. After that, the process is driven by an order dispatch policy learned with reinforcement learning.
Healthcare
Healthcare is one of the most crucial sectors where there are many opportunities and challenges for AI where reinforcement learning could be used. We will discuss some of these below –
Dynamic treatment strategies (DTRs)– DTR is a process of treatment which comprises of a sequence of decision rules that determine how the patient’s ongoing treatment should evolve based on the current state and the covariate history (A covariate represents any continuous variable that is expected to correlate with the outcome variable). DTRs apply to personalized treatment plans, typically for chronic conditions.
In the case of DTRs, we could consider –
The stateis composed of a multidimensional discrete-time series composed of variables of interest for the treatment (demographics, vital signs, etc.). We can use clustering to determine the state space such that patients in the same cluster are similar for the observable properties.
An actioncomprises the medical treatment the patient receives administered as doses of medicine over a sequence of time.
The rewarddirectly is the patient’s health stating whether the health improves or deteriorates.
Try one for yourself?
Another healthcare application for reinforcement learning can be generation of reports from medical images. A medical report comprises specific segments such as the findings, the report’s conclusion (main finding and diagnosis), any secondary information, etc.For this case, I leave the problem on to the reader for formulating the same into a reinforcement learning problem.
Hint– In this scenario, first a CNN (convolutional neural network) is used to extract a set of images’ visual features and transform the features into a context vector. From this context vector, a sentence decoder generates latent topics recurrently. Based on a latent topic, a retrieval policy module generates sentences using either a generation approach or a template. The RL based retrieval policy integrates prior human knowledge.
Hope you enjoyed reading the blog! For any questions or doubts, please drop a comment.
About Me (Kajal Singh)
Kajal Singh is a Data Scientist and a Tutor at the Artificial Intelligence – Cloud and Edge implementations courseat the University of Oxford. She is also the co-author of the book “Applications of Reinforcement Learning to Real-World Data: An educational introduction to the fundamentals of Reinforcement Learning with practical examples on real data (2021)”.
Hello! I like to share my interesting experience While I was working as a junior Data Scientist, I can even say I was a beginner during that time in this data science domain.
One of the customers came to us for machine learning implementation for their problem statement in either way unsupervised and supervised forms, I thought it was going to be as usual mode of execution and process because based on my experience for small scale implementation or during my training period we use to have 25-30 features and we play around with that and we use to predict or classify or clustering the dataset and share the outcome.
But this time they come up with thousands of features, But I was a little surprised and scared about the implementation and my head started spinning as anything. Same time my Senior Data Scientist brought everyone from the team into the meeting room.
My Senior Data Scientist (Sr. DS) coined the new word to us, that is nothing butDimensionality Reduction(OR)Dimension Reduction(OR)Curse Of Dimensionality,all beginners thought that he is going to explain something in Physis, we had little remembrance that we had come across this term during our training programme. then he started to sketch on the board (Refer fig-1). When we started looking at 1-D, 2-D we are much comfortable but 3-D and above our heads started to spin.
1-D and 2-D
3 – D
Sr. DS has continued his lecture, all these sample pictures are just notablefeatures and we could play aroundwith these, in a real-time scenario, many Machine Learning(ML) problems involve thousands of features, so we end up training those models became extremely slow and will not give good solutions for business problem and we couldn’t freeze the model, this situation is the so-called “Curse Of Dimensionality” working. Then we all started asking a question that how we should handle this.
He took a long breath and continue to share his experience in his own style. He started with a simple definition as follows.
What is Dimensionality?
We can say the number of features in our dataset is referred to as its dimensionality.
What is Dimensionality Reduction?
Dimensionality Reduction is the process of reducing the dimensions(features) of a given dataset. Let’s say if your dataset with a hundred columns/features and bringing the number of columns down to 20-25. In simple terms, you are converting theCylinder/Sphere to a CircleorCube into a Planein the two-dimensional space as below figure.
Converting 3D- 2D
He has drawn below the relationship clearly betweenModle PerformanceandNumber of Features(Dimensions). As the number of features increases, the number of data points also increases proportionally. the straight statement is that the more features will bring more data samples, So we have represented all combinations of features and their values.
Modle PerformanceVs Number of Features
Now everyone in the room got the feel of what is “Curse Of Dimensionality” at a very high level.
Benefits of doing Dimensionality Reduction
Suddenly, one of the team members asked can he tell us the benefits of doing dimensionality reduction in the given dataset.
Our Sr. DS didn’t stop sharing his extensive knowledge further. He has continued as below.
There are lots of benefits if we go with dimensionality reduction.
It helps to remove redundancy in the features and noise error factors ultimately enhanced visualization of the given data set.
Excellent memory management activity has been exhibited due to dimensionality reduction.
Improving the performance of the model by choosing the right features by removing the unnecessary lists of features from the dataset.
Certainly, less number of dimensions (mandatory list of dimensions) required less computing efficiency and train the model faster with improved model accuracy.
Considerably reducing the Complexity and Overfitting of the overall model and its performance.
Yes! it was an awe-inspiring spectacle, robustness, and dynamics of the “Dimensionality Reduction”. Now I can visualization the overall benefit as below. hope it could help you too
Benefits of Dimensionality Reduction.
What is next, Of Course! We jump into the next major question that what are techniques available for Dimensionality Reduction.
Dimensionality Reduction – Techniques
Our Sr. DS very much interested continued his explanation on the techniques whichever possible in Data Science domain, broadly classified into two approaches as mentioned earlier considering selecting the best-fit Feature(s) or removing less important Feature in the given high dimensional dataset. these high-level techniques use to be calledFeature SelectionorFeature Extraction,and basically, this is part ofFeature Engineering.He has connected the dots perfectly.
Locating Dimensionality Reduction in Feature Engineering family
He took us further in-depth concepts to understand the big picture of applied “Dimensionality Reduction” on the high dimensional dataset. Once we saw the below figure we able to relate the Feature Engineering and Dimensionality Reduction. Look at this figure the essence of Dimensionality Reduction well by our Sr. DS is in it!
Everyone was interested to know how to apply all these using Phyton libraries with the help of simple coding. our Sr. DS asked me to bring colorful markers and dusters
Sr. DS picked up the new blue marker and started explaining PCA with a simple example as follows, before that he explained what is PCA stuff for dimensionality reduction.
Principal Component Analysis(PCA): PCA is a technique for dimensionality reduction of a given dataset, by increasing interpretability with negligible information loss. Here the number of variables is decreasing, so it makes further analysis simpler. Which converts a set of correlated variables to a set of uncorrelated variables. Used for machine learning predictive modeling. And he advised us to go through Eigenvector, Eigen Values
He took familiar wines.csv for his quick analysis.
# Import all the necessary packages import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns from sklearn.model_selection
import train_test_split from sklearn.linear_model
import LinearRegression from sklearn.metrics
import confusion_matrix from sklearn.metrics
import accuracy_score from sklearn
import metrics %matplotlib inline
import matplotlib.pyplot as plt
%matplotlib inline
wq_dataset = pd.read_csv('winequality.csv')
EDA on a given data set
wq_dataset.head(5)
wq_dataset.describe()
wq_dataset.isnull().any()
No Null value in the given data set, So great and we’re lucky.
We all surprised when looking at the above two columns with new column name and values, We asked what happen to‘fixed acidity’, ‘volatile acidity, ‘citric acid’, ‘chlorides’, ‘total sulfur dioxide’, ‘density’, ‘sulphates’, ‘alcohol’ columns. Sr. DS said all gone, now we have just two columns after we applied PCA for dimensionality reduction on given data and we are going to implement few models and this is going to be the normal way.
He has mentioned one keyword“variation per principal component”
this is the fraction ofvariance explainedby aprincipal componentis the ratio between thevarianceof thatprincipal componentand the totalvariance.
print('Explained variation per principal component: {}'.format(pca_wins.explained_variance_ratio_))
Explained variation per principal component: [0.99615166 0.00278501]
Followed by this he was demonstrated the following models
Logistic Regression
Random forest
KNN
Naive Bayes
Accuracy was better and little difference among each model, but he has mentioned this is for PCA implementation. Everyone in the room felt that we have completed an excellent roller coaster. he has advised us to do hands-on other Dimensionality Reduction – Techniques.
Okay, Guys! Thanks for your time, hope I able to narrate my learning experience of Dimensionality Reduction – Techniques in right ways here, I trust it would help to continue the journey to handle complex data set in machine learning problem statement. Cheers!
On April 10, 2019, the news broke out that NASA scientists obtained the image of a black hole first time in the history of humanity. In this blog, we will be gaining knowledge about how our python programming language and its various tools contributed to obtaining this historic victory.
However, before jumping to the technical aspects of programming, you must know about EHT(Event Horizon telescope). It is said that EHT is computationally a telescope the size of the earth. It has an angular resolution of 20 micro-arcseconds that is enough to read a magazine in New Delhi from a park in Tokyo.
Challenges faced
EHT was trained on a highly massive black hole studied for over 100 years, present at the center of the M87 galaxy having mass 6.5 times the Sun’s but never observed before visually. Having this powerful device bought the following challenges to them.
Massive Data pre-processing
Rapid atmospheric phase fluctuations
Large recording bandwidth
EHT produce 350 TB of observations that makes reducing the volume and complexity of data extremely difficult
Below you can see how the EHT data pre-processing pipeline looks like
Use of Python and its tools
The diagram below illustrates the role of the scientific python ecosystem in the data analysis of data produced by EHT.
A python package ‘eht. imaging’ uses Numpy, scipy, matplotlib, astropy, scikit-imageas its core of array data pre-processing responsible for performing simulation and image reconstruction on data. Below is the dependency chart of ‘eht. imaging’ package(see the leaf nodes).
Key Python capabilities used
Numpy’s and adaptable n-dimensional array helped researchers manipulate large and numerical data sets providing the foundation for the image of a black hole.
Scaling a vast volume of data of 350 TB per day involved in EHT imaging.
The complexity of data correlation from telescopes all across the globe with data synchronization.
Speed for fast analysis capability to quickly image, manipulate datasets with corrections.
Conclusion
We all have witnessed the increasing popularity of Python in the past many years. Using Python from e-commerce to social media, there is nothing Python cannot do if used properly. Nevertheless, after seeing the usage of Python in advanced science, creating history is an achievement for this beautiful language.
Hey GEEKS! Here’s my new blog on Pandas library, an in-depth tool for Data science learners.
It’s never too difficult to transform your dataset into a valuable piece for your project. It requires deep research and study about knowing what is your interesting part of data.
Python enables you to get your hands dirty with data in many ways. One of them,which is my favorite is its PANDAS library. As much as its name excites me, also its flexible nature of accepting all types of data like JSON,CSV, XLSX, etc, and numerous fancy features like slice/label indexing, time series functions facilitate me in data analysis.
Here I’ve got some really helpful commands that would help you analyze your data better.
first, we need to import the dataset
The EDA gets tougher in handling large datasets. Here are some quick DS hacks to help you out!
select_dtypes()
This function enables you to select columns in a data frame with their data types. By this, you can make many useful subsets of data based upon their column dtypes.
Sounds interesting? Let me show you with code
melt()
The melt() function is used to transform a data frame from wide to a long format. It can take multiple columns as identifiers of the table and allows you to see your data with two non-identifier columns (variable and value) to observe the measured variables.
Identifiers(car & age) tells about non-identifiers (either the insurance is taken or not)
columns.tolist()
Another interesting practice most common in data munging and analysis is to convert columns obj datatype to a list. Since its easier to make modifications to a list as compared to an object, this method helps us to extract only those columns which have a great impact on our target column or make changes in the order of the columns in a data frame.
car_loan column becomes the 1st column now
str.split()
With the help of Pandas, we can split a single string into multiple columns with str.split() method. This means it will return a data frame with all separated strings in different columns. You can easily do this by specifying the separator value in the function. When a separator isn’t written, whitespace is taken as an input.
Column Car splits into two columns vehicle and car_number
to_datetime()
Handling datetime features in data science is also very important as it requires correct formatting and values to train the model.
This function takes datetime argument as an input parameter and converts it into a python datetime object.
Lambda Functions:
They are the nameless functions(anonymous) that can take multiple arguments(bound variable) but returns only one expression(the body part). These functions execute faster as compared to normal functions.
We apply a lambda function with the help of apply() function.
In the above example, we have merged call_start and call_end columns into a single column i.e. call_duration, and then called lambda function to convert the datetime object value into integer with the help of x.seconds() function.
SPARQL is a powerful language for working with RDF triples. However, SPARQL can also be difficult to work with, so much so that it often is not utilized anywhere near as often for its advanced capabilities, which include aggregating content, building URIs, and similar uses. This is the second piece in my exploration of OntoText’s GraphDB database, but many of these techniques can be applied with other triple stores as well.
Tip 1. OPTIONAL and Coalesce()
These two keywords tend to be used together because they take both advantage of the Null value. In the simplest case, you can take advantage of coalesce to provide you with a default value. For instance, suppose that you have an article that may have an associated primary image, which is an image that is frequently used for generating thumbnails for social media. If the property exists, use it to retrieve the URL, but if it doesn’t, use a default image URL instead.
# Turtle
article:_MyFirstArticle
a class:_Article;
article:hasTitle “My Article With Image”^^xsd:string;
article:hasTitle “My Article Without Image”^^xsd:string;
.
With SPARQL, the OPTIONAL statement will evaluate a triple expression, but if no value is found to match that query then rather than eliminating the triple from the result set, SPARQL will set any unmatched variables to the value null. The coalesce statement can then query the variable, and if the value returned is null, will offer a replacement:
#SPARQL
select ?articleTitle ?articleImageURL where {
?article a class:_Article.
?article article:hasTitle ?title.
optional {
?article article:hasPrimaryImage ?imageURL.
}
bind(coalesce(?imageURL,”path/to/defaultImage.jpg”^^xs:anyURI) as ?articleImageURL)
}
This in turn will generate a tuple that looks something like the following:
articleTitle
articleImageURL
My Article With Image
path/to/primaryImage.jpg
My Article Without Image
path/to/defaultImage.jpg
Coalesce takes an unlimited sequence of items and returns the first item that does not return a null value. As such you can use it to create a chain of precedence, with the most desired property appearing first, the second most desired after that and so forth, all the way to a (possible) default value at the end.
You can also use this to create a (somewhat kludgy) sweep of all items out a fixed number of steps:
# SPARQL
select ?s0 ?s1 ?s2 ?s3 ?s4 ?o ?hops where {
values ?s1 {my:_StartingPoint}
bind(0 as ?hops0)
?s1 ?p1 ?s2.
filter(!bound(?p1))
bind(1 as ?hops1)
optional {
?s2 ?p2 ?s3.
filter(!bound(?p2))
bind(2 as ?hops2)
optional {
?s3 ?p3 ?s4.
filter(!bound(?p3))
bind(3 as ?hops3)
optional {
?s4 ?p4 ?o.
filter(!bound(?p4))
bind(4 as ?hops4)
}
}
}
bind(coalesce(?hops4,?hops3,?hops2,?hops1,?hops) as ?hops)
}
The bound() function evaluates a variable and returns true() if the variable has been defined and false() otherwise, while the ! operator is the not operator – it flips the value of a Boolean from true to false and vice-versa. Note that if the filter expression evaluates to false(), this will terminate the particular scope. A bind() function will cause a variable to be bound, but so will a triple expression … UNLESS that triple expression is within an OPTIONAL block and nothing is matched.
This approach is flexible but potentially slow and memory intensive, as it will reach out to everything with four hops of the initial node. The filter statements act to limit this: if you have a pattern node-null-null, then this should indicate that the object is also a leaf node, so no more needs to be processed. (This can be generalized, as will be shown below, if you’re in a transitive closure situation).
Tip 2. EXISTS and NOT EXISTS
The EXISTS and NOT EXISTS keywords can be extraordinarily useful, but they can also bog down performance dramatically is used incorrectly. Unlike most operators in SPARQL, these two actually work upon sets of triples, returning true or false values respectively if the triples in question exist. For instance, if none of ?s, ?p or ?o have been established yet: the expression:
# SPARQL
filter(NOT EXISTS {?s ?p ?o})
WILL cause your server to keel over and die. You are, in effect, telling your server to return all triples that don’t currently exist in your system, and while this will usually be caught by your server engine’s exception handler, this is not something you want to test.
However, if you do have at least one of the variables pinned down by the time this expression is called, these two expressions aren’t quite so bad. For starters, you can use EXISTS and NOT EXISTS within bind expressions. For example, suppose that you wanted to identify any orphaned link, where an object in a statement does not have a corresponding link to a subject in another statement:
# SPARQL
select ?o ?isOrphan where {
?s ?p ?o.
filter(!(isLiteral(?o))
bind(!(EXISTS {?o ?p1 ?o2)) as ?isOrphan)
}
In this particular case, only those statements in which the final term is not a literal (meaning those for which the object is either an IRI or a blank node) will be evaluated, The bind statement then looks for the first statement in which the ?o node is a subject in some other statement, the EXISTS keyword then returns true if at least one statement is found, while the ! operator inverts the value. Note that EXISTS only needs to find one statement to be true, while NOT EXISTS has to check the whole database to make sure that nothing exists. This is equivalent to the any and all keywords in other languages. In general, it is FAR faster to use EXISTS this way than to use NOT EXISTS.
Tip 3. Nested IF statements as Switches (And Why You Don’t Really Need Them)
The SPARQL if() statement is similar to the Javascript condition?trueExpression:falseExpression operator, in that it returns a different value based upon whether the condition is true or false. While the expressions are typically literals, there’s nothing stopping you from using object IRIs, which can in turn link to different configurations. For instance, consider the following Turtle:
#Turtle
petConfig:_Dog a class:_PetConfig;
petConfig:hasPetType petType:_Dog;
petConfig:hasSound “Woof”;
.
petConfig:_Cat a class:_PetConfig;
petConfig:hasPetType petType:_Cat;
petConfig:hasSound “Meow”;
.
petConfig:_Bird a class:_PetConfig;
petConfig:hasPetType petType:_Bird;
petConfig:hasSound “Tweet”;
.
pet:_Tiger pet:says “Meow”.
pet:_Fido pet:says “Woof”.
pet:_Budger pet:says “Tweet”.
You can then make use of the if() statement to retrieve the configuration:
# SPARQL
select ?pet ?petSound ?petType where {
values (?pet ?petSound) {(pet:_Tiger “Meow”)}
bind(if(?petSound=’Woof’,petType:_Dog,
?petSound=’Meow’,petType:_Cat,
?petSound=’Tweet’,petType:_Bird,
()) as ?petType)
}
where the expression () returns a null value.
Of course, you can also use a simple bit of Sparql to infer this without the need for the if s#tatement:
# SPARQL
select ?pet ?petSound ?petType where {
values (?pet ?petSound) {(pet:_Tiger “Meow”)}
?petConfig petConfig:hasSound ?petSound.
?petConfig petConfig:hasPetType ?petType.
}
with the results:
?pet
?petSound
?petType
pet:_Tiger
“Meow”
petType:_Cat
As a general rule of thumb, the more that you can encode as rules within the graph, the less that you need to rely on if or switch statements and the more robust your logic will be. For instance, while a dogs and cats express themselves in different ways most of the time, both of them can growl:
#Turtle
petConfig:_Dog a class:_PetConfig;
petConfig:hasPetType petType:_Dog;
petConfig:hasSound “Woof”,”Growl”,”Whine”;
.
petConfig:_Cat a class:_PetConfig;
petConfig:hasPetType petType:_Cat;
petConfig:hasSound “Meow”,”Growl”,”Purr”;
.
?pet
?petSound
?petType
pet:_Tiger
“Growl”
petType:_Cat
Pet:_Fido
“Growl”
petType:_Dog
In this case, the switch statement would break, as Growl is not in the options, but the direct use of SPARQL works just fine.
Tip 4. Unspooling Sequences
Sequences, items that are in a specific order, are fairly easy to create with SPARQL but surprisingly there are few explanations for how to build them . . . or query them. Creating a sequence in Turtle involves putting a list of items in between parenthesis as part of an object. For instance, suppose that you have a book that consists of a preface, five numbered chapters, and an epilogue. This would be expressed in Turtle as:
Note that there are no commas between each chapter.
Now, there is a little magic that Turtle parsers do in the background when parsing such sequences. They actually convert the above structure into a string with blank nodes, using the three URIs rdf:first, rdf:rest and rdf:nil. Internally, the above statement looks considerably different:
# Turtle
book:_StormCrow book:hasChapter _:b1.
_:b1 rdf:first chapter:_Prologue.
_:b1 rdf:rest _:b2.
_:b2 rdf:first chapter:_Chapter1.
_:b2 rdf:rest _:b3.
_:b3 rdf:first chapter:_Chapter2.
_:b3 rdf:rest _:b4.
_:b4 rdf:first chapter:_Chapter3.
_:b4 rdf:rest _:b5.
_:b5 rdf:first chapter:_Chapter4.
_:b5 rdf:rest _:b6.
_:b6 rdf:first chapter:_Chapter5.
_:b6 rdf:rest _:b7.
_:b7 rdf:first chapter:_Epilogue.
_:b7 rdf:rest rdf:nil.
While this looks daunting, programmers might recognize this as being a very basic linked list, whether rdf:first points to an item in the list, and rdf:rest points to the next position in the list. The first blank node, _:b1, is then a pointer to the linked list itself. The rdf:nil is actually a system defined URI that translates into a null value, just like the empty sequence (). In fact, the empty sequence in SPARQL is in fact the same thing as a linked list with no items and a terminating rdf:nil.
Since you don’t know how long the list is likely to be (it may have one item, or thousands) building a query to retrieve the chapters in their original order would seem to be hopeless. Fortunately, this is where transitive closure and property paths come into play. Assume that each chapter has a property called chapter:hasTitle (a subproperty of rdfs:label). Then to retrieve the names of the chapters in order for a given book, you’d do the following:
# SPARQL
select ?chapterTitle where {
values ?book {book:_StormCrow}
?book rdf:rest*/rdf:first ?chapter.
?chapter chapter:hasTitle ?chapterTitle.
}
That’s it. The output, then, is what you’d expect for a sequence of chapters:
pointsTo
chapter:_Prologue
chapter:_Chapter1
chapter:_Chapter2
chapter:_Chapter3
rdf:nil
The property path rdf:rest*/rdf:first requires a bit of parsing to understand what is happening here. property* indicates that, from the subject, the rdf:rest path is traversed zero times, one time, two times, and so forth until it finally hits rdf:nil. Traversing zero times may seem a bit counterintuitive, but it means simply that you treat the subject as an item in the traversal path. At the end of each path, the rdf:first link is then traversed to get to the item in question (here, each chapter in turn. You can see this broken down in the following table:
path
pointsTo
rdf:first
chapter:_Prologue
rdf:rest/ rdf:rest/rdf:first
chapter:_Chapter1
rdf:rest/r rdf:rest/ rdf:rest/df:first
chapter:_Chapter2
rdf:rest/ rdf:rest/ rdf:rest/ rdf:rest/rdf:first
chapter:_Chapter3
rdf:rest/ rdf:rest/ rdf:rest/ rdf:rest/rdf:rest
rdf:nil
If you don’t want to include the initial subject in the sequence, then use rdf:rest+/rdf:first where the * and + have the same meaning as you may be familiar with in regular expressions, zero or more and one or more respectively.
This ability to traverse multiple repeating paths is one example of transitive closure. Transitive closures play a major role in inferential analysis and can easily take up a whole article in its own right, but for now, it’s just worth remembering the ur example – unspooling sequences.
The ability to create sequences in TURTLE (and use them in SPARQL) makes a lot of things that would otherwise be difficult if not impossible to do surprisingly easy.
As a simple example, suppose that you wanted to find where a given chapter is in a library of books. The following SPARQL illustrates this idea:
# SPARQL
select ?book where {
values ?searchChapter {?chapter:_Prologue}
?book a class:_book.
?book rdf:rest*/rdf:first ?chapter.
filter(?chapter=?searchChapter)
}
This is important for a number of reasons. In publishing in particular there’s a tendency to want to deconstruct larger works (such as books) into smaller ones (chapters), in such a way that the same chapter can be utilized by multiple books. The sequence of these chapters may vary considerably from one work to the next, but if the sequence is bound to the book and the chapters are then referenced there’s no need for the chapters to have knowledge about its neighbors. This same design pattern occurs throughout data modeling, and this ability to maintain sequences of multiply utilized components makes distributed programming considerably easier.
Tip 5. Utilizing Aggregates
I work a lot with Microsoft Excel documents when developing semantic solutions, and since Excel will automatically open up CSV files, using SPARQL to generate spreadsheets SHOULD be a no brainer.
However, there are times where things can get a bit more complex. For instance, suppose that I have a list of books and chapters as above, and would like for each book to list it’s chapters in a single cell. Ordinarily, if you just use the ?chapterTitle property as given above, you’ll get one line for each chapter, which is not what’s wanted here:
# SPARQL
select ?bookTitle ?chapterTitle where {
values ?searchChapter {?chapter:_Prologue}
?book a class:_book.
?book rdf:rest*/rdf:first ?chapter.
?chapter chapter:hasTitle ?chapterTitle.
?book book:hasTitle ?bookTitle.
}
This is where aggregates come into play, and where you can tear your hair out if you don’t know the Ninja Secrets. To make this happen, you need to use subqueries. A subquery is a query within another query that calculates output that can then be pushed up to the calling query, and it usually involves working with aggregates – query functions that combine several items together in some way.
One of the big aggregate workhorses (and one that is surprisingly poorly documented) is the concat_group() function. This function will take a set of URIs, literals or both and combine them into a single string. This is roughly analogous to the Javascript join() function or the XQuery string-join() function. So, to create a comma separated list of chapter names, you’d end up with a SPARQL script that looks something like this:
# SPARQL
select ?bookTitle ?chapterList ?chapterCount where {
?book a class:_book.
?book book:hasTitle ?bookTitle.
{{
select ?book
(group_concat(?chapterTitle;separator=”\n”) as ?chapterList)
(count(?chapterTitle) as ?chapterCount) where {
?book rdf:rest*/rdf:first ?chapter.
?chapter chapter:hasTitle ?chapterTitle.
} group by ?book
}}
}
The magic happens in the inner select, but it requires that the SELECT statement includes any variable that is passed into it (here ?book) and that the same variable is echoed in the GROUP BY statement after the body of the subquery.
Once these variables are “locked down”, then the aggregate functions should work as expected. The first argument of the group_concat function is the variable to be made into a list. After this, you can have multiple optional parameters that control the output of the list, with the separator being the one most commonly used. Other parameters can include ROW_LIMIT, PRE (for Prefix string), SUFFIX, MAX_LENGTH (for string output) and the Booleans VALUE_SERIALIZE and DELIMIT_BLANKS, each separated by a semi-colon. Implementations may vary depending upon vendor, so these should be tested.
Note that this combination can give a lot of latitude. For instance, the expression:
will generate an HTML list sequence, and similar structures can be used to generate tables and other constructs. Similarly, it should be possible to generate JSON content from SPARQL through the intelligent use of aggregates, though that’s grist for another article.
The above script also illustrates how a count function has piggy-backed on the same subquery, in this case using the COUNT() function.
It’s worth mentioning the spif:buildString() function (part of the SPIN Function library that is supported by a number of vendors) which accepts a string template and a comma-separated list of parameters. The function then replaces each instance of “{?1}”,”{?2}”, etc. with the parameter at that position (the template string being the zeroeth value). So a very simple report from above may be written as
# SPARQL
bind(spif:buildString(“Book ‘{$1}’ has {$2} chapters.”,?bookTitle,?chapterCount) as ?report)
which will create the following ?report string:
Book ‘Storm Crow’ has 7 chapters.
This templating capability can be very useful, as templates can themselves be stored as resource strings, with the following Turtle:
#Turtle
reportTemplate:_BookReport
a class:_ReportTemplate;
reportTemplate:hasTemplateString “Book ‘{$1}’ has {$2} chapters.”^^xsd:string;
.
This can then be referenced elsewhere:
#SPARQL
select ?report where {
?book a class:_book.
?book book:hasTitle ?bookTitle.
{{
select ?book
(group_concat(?chapterTitle;separator=”\n”) as ?chapterList)
bind(spif:buildString(?reportStr,?bookTitle,?chapterCount) as ?report).
}
With output looking something like the following:
report
Book ‘Storm Crow’ has 7 chapters.
Book “The Scent of Rain” has 25 chapters.
Book “Raven Song” has 18 chapters.
This can be extended to HTML-generated content as well, illustrating how SPARQL can be used to drive a basic content management system.
Tip 6. SPARQL Analytics and Extensions
There is a tendency among programmers new to RDF to want to treat a triple store the same way that they would a SQL database – use it to retrieve content into a form like JSON and then do the processing elsewhere. However, SPARQL is versatile enough that it can be used to do basic (and not so basic) analytics all on its own.
For instance, consider the use case where you have items in a financial transaction, where the items may be subject to one of three different types of taxes, based upon specific item details. This can be modeled as follows:
This is a fairly common real world scenario, and the logic for handling this in a traditional language, while not complex, is still not trivial to determine a total price. In SPARQL, you can again make use of aggregate functions to do things like get the total cost:
#SPARQL
select ?order ?totalCost where {
values ?order {order:_ord123}
{{
select ?order (sum(?itemTotalCost) as ?totalCost) where {
?order order:hasItems ?itemList.
?itemList rdf:rest*/rdf:first ?item.
?item item:hasPrice ?itemCost.
?item item:hasTaxType ?taxType.
?taxType taxType:hasRate ?taxRate.
bind(?itemCost * (1 + ?taxRate) as ?itemTotalCost)
}
group by ?order
}}
}
While this is a simple example, weighted cost sum equations tend to make up the bulk of all analytics operations. Extending this to incorporate other factors such as discounts is also easy to do in situ, with the following additions to the model:
filter(now() >= ?discountStartDate and ?discountEndDate >= now())
}
bind(coalesce(?DiscountRate,0) as ?discountRate)
bind(?itemCost*(1 – ?discountRate)*(1 + ?taxRate) as ?itemTotalCost)
}
}}
}
In this particular case, taxes are required, but discounts are optional. Also note that the discount price is only applicable around Memorial Day weekend, with the filter set up in such a way that ?DiscountRate would be null at any other time. The conditional logic required to support this externally would be getting pretty hairy at this point, but the SPARQL rules extend it with aplomb.
There is a lesson worth extracting here: use the data model to store contextual information, rather than relying upon outside algorithms. It’s straightforward to add another discount period (a sale, in essence) and with not much more work you can even have multiple overlapping sales apply on the same item.
Summary
The secret to all of this: these aren’t really Ninja secrets. SPARQL, while not perfect, is nonetheless a powerful and expressive language that can work well when dealing with a number of different use cases. By introducing sequences, optional statements, coalesce, templates, aggregates and existential statements, a good SPARQL developer can dramatically reduce the amount of code that needs to be written outside of the database. Moreover, by taking advantage of the fact that in RDF everything can be a pointer, complex business rules can be applied within the database itself without a significant overhead (which is not true of SQL stored procedures).
So, get out the throwing stars and stealthy foot gloves: It’s SPARQL time!
The first part of this list was published here. These are articles that I wrote in the last few years. The whole series will feature articles related to the following aspects of machine learning:
Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science)
Opinions, for instance about the value of a PhD in our field, or the use of some techniques
Methods, principles, rules of thumb, recipes, tricks
My articles are always written in simple English and accessible to professionals with typically one year of calculus or statistical training, at the undergraduate level. They are geared towards people who use data but are interesting in gaining more practical analytical experience. Managers and decision makers are part of my intended audience. The style is compact, geared towards people who do not have a lot of free time.
Despite these restrictions, state-of-the-art, of-the-beaten-path results as well as machine learning trade secrets and research material are frequently shared. References to more advanced literature (from myself and other authors) is provided for those who want to dig deeper in the interested topics discussed.
1. Machine Learning Tricks, Recipes and Statistical Models
These articles focus on techniques that have wide applications or that are otherwise fundamental or seminal in nature.
Statistics: New Foundations, Toolbox, and Machine Learning Recipes
Available here. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach focuses on robust techniques; it is bottom-up (from applications to theory), in contrast to the traditional top-down approach.
The material is accessible to practitioners with a one-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications with numerous illustrations, is aimed at practitioners, researchers, and executives in various quantitative fields.
Applied Stochastic Processes
Available here. Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems (104 pages, 16 chapters.) This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject.
It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). He recently opened Paris Restaurant, in Anacortes. You can access Vincent’s articles and books, here.
If you wanted to recruit for “data science” talent at a university, where would go? Should you go to the College of Computing? Would it be in the College of Business? Is it in the Department of Mathematics? Statistics? Is there even a Department of Data Science?
There is more variation in the housing of data science than any other academic discipline on a university campus. Why the variation? And why should you care?
The answer to the first question – Why the variation? – may not be straightforward.
As in any organization, not all academic programs are a function of long-term, well-considered strategic planning – many analytics programs evolved at the intersection of resources, needs, and opportunity (and some noisy passionate faculty). As universities began to formally introduce data science programs around 2006, there was little consistency regarding where this new discipline should be housed. Given the “academic ancestry” of analytics and data science it is not surprising that there is variation of placement of programs across the academic landscape:
Exacerbating this, we do not yet have a universal consensus as to what set of competencies should be common to a data science curriculum – again largely due to its transdisciplinary foundations. The fields of computer science, mathematics, statistics, and almost every applied field (business, health care, engineering) have professional organizations and long-standing models for what constitutes competency in those fields. Data science has no standardization, no accreditation, and no certifying body. As a result, the “data science” curriculum may look very different at different universities – all issues that have contributed to the misalignment of expectations for both students and for hiring managers.
The second question (why should you care?) might be more relevant –
Generally, universities have approached the evolution of data science from one of two perspectives – as a discipline “spoke” (or series of electives) or as a discipline “hub” (as a major) as in the graphic above.
University programs that are “hubs” – reflecting the model above on the left – have likely been established as a “major” field of study. These programs are likely to be housed in a more computational college (e.g., Computing, Science, Statistics) or in a research unit (like a Center or Institute) and will focus on the “science of the data”. They tend to be less focused on the nuances of any individual area of application. Hub programs will (generally) allow (encourage) their students to take a series of electives in some application domain (i.e., students coming out of a hub program may go into Fintech, but they may also go into Healthcare – their major is “data”). Alternatively, programs that are “spokes” – reflecting the model above on the right – are more likely to be called “analytics” and are more frequently housed in colleges of business, medicine, and the humanities. Programs that are “spokes” are (generally) less focused on the computational requirements and are more aligned with applied domain-specific analytics. Students coming out of these programs will have stronger domain expertise and will better understand how to integrate results into the original business problem but may lack deep computational skills. Neither is “wrong” or “better” – the philosophical approaches are different.
Understanding more about where an analytics program is housed and whether analytics is treated as a “hub” or a “spoke” should inform and improve analytics professionals collaborative experiences with universities.
The book “Closing the Analytics Talent Gap” is available through Amazon and directly from the publisher CRC Press.
Due to the response to the COVID-19 pandemic, banks, and financial services companies have faced unexpected difficulties over the past year. By accelerated adaptation initiatives and a greater emphasis on their digital presence as consumers go online to access essential services, the industry is rethinking the future of digital banking. During these troubling times, business leaders are increasingly realizing the enormous potential of digitization and the need to integrate it into onboarding to maintain a foothold in the sector.
Digitization is a Necessity Not an Option
The pre-COVID-19 financial industry still had to update the digital experience by incorporating consumer preferences into business strategies, despite the demand for digital experiences among customers of all ages. In recent months, there has been a dramatic acceleration in the digital race. The ‘nice-to-have or ‘Plan B’ is no longer applicable fordigital onboardingand real-time communication tools, but rather it’s a mandatory requirement. Companies need to look at video conferencing, digital document management, digital signature, video identification and biometrics, and cloud services and use smart automation to reduce operational burdens and ensure security compliance.
Faster Implementation is Key
The response to COVID-19 has put the spotlight on two major categories of disadvantages in traditional onboarding namely slow processes and poor customer experience. Traditional processes are generally slow, repetitive, and complex. Low speed of progress can lead toa 40% abandonment ratein onboarding, and nearly7 out of 10 millennials demand a seamlessly integrated experiencefor digital services across all channels. There is a demand for faster and easier onboarding processes.
Aggregate and Integrate
The need to address the disaggregation of client data, processes, and stakeholders was a core problem in financial services at the beginning of 2020. It is vital to support a remote onboarding model with repeated cycles of government-imposed social distancing by implementing intelligent orchestration within the IT structure of the organization, which leads to greater organizational agility and resilience. This strategy enables businesses to incorporate third-party technologies such as biometrics, cloud, real-time messaging services, artificial intelligence (AI), NBA (next-best-action) seamlessly while streamlining interactions with digital consumers.67% of customers prefer self-service instead of speaking to a company representative. AI can help integrate this. Smart analytics and artificial intelligence (AI) are being used by businesses to figure out what really matters to their customers. This enables them on the platforms of their choosing to give consumers what they care about.
eKYC for Secure Onboarding
Digital onboarding provides companies with the opportunity to react at the pace of information to changes. Organizations using these solutions can, within a few minutes, sign up their clients, open their bank accounts, complete a loan procedure, and their KYC – a process that had previously taken days to complete. This is exactly why organizations, in order to prepare for the future, must act now. Organizations will have to work alongside a recovery strategy as business resumes, where both downtime and inefficiency of employees will have an undesirable impact. EarlyeKYCadopters are now writing their success stories and redefining the consumer needs baseline. We also live in a world where consumers are busier than ever, which means that it is important to eliminate friction across the entire customer journey across both digital and physical channels and allow hybrid experiences. Customers are pleased not to deal with time-consuming, non-essential documents and have the ability to concentrate on more important and meaningful data. IDcentral’s suite of onboarding services provides end-to-end eKYC solution that helpsonboarding the customerswith ease without compromising the security. Additionally, IDcentral also helps reduce customer drop-out rates by transcending beyond the lengthy processes and inefficiencies of current solutions. IDcentral’s highly accurateOCRis able to extract exact information even if the document uploaded is blurred or skewed by 360 degrees which improves customer experience by eliminating iterations while uploading the ID documents. IDcentral’s liveness solution includes 3D Mapping, Skin Texture Analysis, Micro Movement Detection capabilities which eliminate the slightest possibility of morphing and spoof attacks. Companies can pick and choose the services they need as per their vertical and risk appetite and access these services free for the first six months which can save a lot of your cost. It is a completely self-serve platform that you can integrate into your system within a few hours and start using it.
Better Communication for Better Customer Satisfaction
For businesses, data is becoming the most coveted commodity, which is why integration becomes all the more important. To ensure that meaningful and appropriate information is still available, connect the various recording systems and keep the data synchronized and consistent. Employees can benefit from the availability of data and boost communication between teams, work effortlessly together and create more value. Innovation is simple because it does not understand how each variable relates to the other. Businesses have to bear in mind that the more connected everything is, the better for the best long-term result.
Expand Existing Connections
Instead of an immense opportunity to establish relationships, client onboarding is often perceived as a repetitive process. For the customer, the lack of concrete details on their onboarding status becomes frustrating. In addition, the absence of a customer’s centralized view can lead to possible fallouts. Make onboarding interactions, even in more complicated situations, as simple as possible. Onboard an entire family in a single process. Include multiple goods and facilities that do not share the same processes (like checking the eligibility of a more complicated service in detail for an individual person). Therefore, to attract the customers of today and convert them to loyal clients, a strategy centered on smoother digital onboarding is more than necessary.
As more and more data resides in online repositories, data backup and protection have taken on critical importance – not just for huge corporations but for organizations of all sizes. In fact, these capabilities may even now be determining an organization’s future. Veeam commissioned an independent survey of 1,550 larger companies (with over 1,000 users) across 18 countries to examine how data backup and restoration are currently being handled.
The survey reveals that as IT practices improve in the area of data protection and backup, a continually changing digital transformation is taking place. Surveyed organizations use a diverse mix of physical servers, virtual machines and cloud-hosted virtual machines (VMs), while around 10% of on-premises data systems will shift to the cloud over the next two years.
The surveyed companies expressed interest in ensuring that the cloud and data are more available to help improve customer experience and the impact of their brands. “However, the research infers that by modernizing data protection with easy-to-use and flexible solutions,” the survey states, “businesses can greatly increase the protection and usability of their data while also freeing a lot of resources to focus further on their IT modernization and management efforts.”
Backup challenges
Backing up and restoring data is a major concern because the data provided by IT is the “heart and soul” of modern companies. Downtime is another issue, with 95% of the surveyed organizations experiencing unexpected outages; at least 10% of servers have at least one outage of two hours on average, once per year. The researchers note, therefore, how important it is to modernize data protection for those inevitable outages. Doing so can help better manage operations, impact customer service, reduce costs and lessen employee task time.
Any time there is a change, and especially a large change like modernizing current systems, there will be challenges. Some include a lack of IT skills in a company’s workforce, a dependency on legacy systems and a lack of staff and/or budget that ultimately prevents them from engaging in this digital transformation.
The surveyed companies indicated that they want to be able to move workloads from on-premises to the cloud, and they want cloud-based disaster recovery. Flexibility of solutions, the researchers conclude, is a big factor in the adoption of new systems and technologies. Data protection, therefore, must be simple, have no delays, and present an immediate return on investment (ROI). It must also be flexible enough to allow for data access from anywhere and at any time. It must continue to be reliable, as well, even as the IT environment evolves.
When planning to improve their current backup systems, companies are looking for reductions in costs and complexity, improved recovery time and reliability. Modernizing your backup into cloud data management can cut the cost of data backup and protection by 50%. That can lead to a 55% increase, says Veeam, in efficiency as well.
Most current mission-critical systems are still tied to legacy solutions, most often located on-site. It’s implausible, then, to expect that organizations will jump directly to a fully modernized backup system. But by starting with a hybrid solution, where data is stored both on-premises and in the cloud, managed by a unified toolset, companies are seeing a 49% savings on costs, according to the survey.
Compliance and security
Another factor is that the cost of compliance is rising as governmental regulation continues to increase across the globe. Moving from ad-hoc or legacy systems to protect and audit data, as companies tend to do now, can result in what the researchers call “isolated pockets of visibility.” And these “pockets” can be targeted by cyber-attackers.
A primary challenge for organizations today is to make sure data is reliably backed up and instantly recovered when needed. As organizations continue to create more and more data, so must data protection and backup rise to the challenge. Modern systems must be more intelligent, anticipate user needs and meet user demands.
Building a new approach
Change is not without its hurdles, but the research demonstrates that organizations cannot afford to ignore the changing IT landscape. Data protection and backup have become mission-critical issues as data volumes continue to explode as data gets distributed across the cloud. Simple, flexible solutions are a must – and they must also be affordable. A robust data management system helps organizations remain compliant and gain greater visibility to defend against attack. As IT leaders consider a new approach to data protection and backup, they should take into account the significant benefits of automated, cloud-friendly solutions.