Data science is driving the industry crazy. It is trending everywhere. Everyone is talking about data science. Whether it’s data science in the industry or data science as a career. Over time, it has become like a superhero! Along with this, we all have frequently heard that data science is one of the most lucrative career options. Do you ever wonder why the companies are offering such a high amount of salaries to the data scientists?
The answer to this question is very simple. We value those things more which are less available. The case of data scientists is also the same. The salaries of data scientists are skyrocketing because there is a shortage of data scientists in the industry. As per the McKinsey report, the United States is facing a shortage of approximately 140,000 data scientists.
Let’s understand why there is a shortage of data scientists and what do companies look for in them.
WHY IS THERE A SHORTAGE?
The major reason why there is a shortage of data scientists in the industry is lack of skills. A person is not valued by its percentages and degrees, but by his skills. Data scientists are highly skilled persons who are supposed to possess technical skills as well as non-technical skills.
But the companies are not able to find the required data science skills in the data science aspirants. That’s why there is a huge shortage of data scientists in the industry.
The other major problem that beginners are facing is that companies are demanding a master’s degree with some years of experience. This is a major issue for them. Being a beginner, they have no experience in the domain of data science and the companies are demanding experience because it’s required for the job. So, that forms a deadlock.
Let’s have a look at the skills that companies are looking for in a data science aspirant. The skills are broadly divided into two categories, i.e. technical skills and non-technical skills.
Technical skills:
In technical skills, a data scientist must have good command over mathematics, statistics, probability, programming, tableau, and big data technologies. Here is the list of technical skills that a data scientist must have:
Descriptive statistics
Inferential statistics
Linear algebra
Calculus
Discrete math
Optimization theory
Python
R
Database query language
Tableau
Big data technologies
Non-technical skills:
Along with technical skills, non-technical skills are also important for a data scientist. Here are the non-technical skills:
Data intuition
Data inquisitiveness
Business expertise
Communication skills
Teamwork
CONCLUSION
These are the skills which a data scientist must possess and skills are the foremost reason why there is a shortage of data scientists in the industry. Work on the above-mentioned skills to drive your career to data science!
Have you heard IDC’s latest predictions on DevOps? According to this study, the DevOps market expects to grow from $2.9 billion to $8 billion by 2022! According to the experts, it is set to dominate the new decade by offering more excellent benefits to developers and users. Organizations are more likely to adopt DevOps at all levels now, as it is efficient and quick to implement. It is already starting to reshape the software world, and 2021 will undoubtedly be bigger and better than in previous years.
So what are the critical DevOps trends for 2021? We think these eight DevOps trends will steal the show now and shortly.
| The rise of AI
The time when manual testing will no longer be the order of the day is not far off. With AI combining with DevOps automation, the change in the way processes are performed is already underway.
AI uses logs and activity reports to predict the performance of a code. When you harness the power of AI, automating acceptance testing, implementation testing, and functional testing becomes easier for organizations. As a result, the software release process becomes flawless, more efficient, and faster, as continuous delivery is assured.
According to the latest expert predictions, there will be increasingly DevOps ideas in workflows due to the growing number of AI-powered applications. DevOps will emerge as the preferred option for managing the testing and maintenance of several models in the production chain.
| Using serverless computing in DevOps
DevOps is ready to reach a new level of excellence with serverless computing. These applications rely on third-party services, called BaaS (Backend as a Service), or on custom code running in temporary containers, called FaaS (Function as a Service).
Serverless means that the company or individual running the system does not need to rent or buy virtual machines to run the backend code.
The main advantage of serverless computing is that it gives the developer the freedom to focus solely on the development aspects of the application without having to worry about anything else. It does not require upgrading existing servers, and deployment is quick and easy. It also takes less time and doesn’t cost a cent.
| Controlling security breaches with DevSecOps
A majority of DevOps companies are turning to DevSecOps because the number of incidents related to security breaches has increased recently. IT companies consider DevSecOps as one of the many DevOps best practices.
Think of DevSecOps as an approach to application security that builds security into every aspect of the code from the beginning.
Security measures in the development process will lead to greater collaboration in the process. It will make the process much more efficient, error-free, and effective. Expect more adoption of DevSecOps in the years to come.
Save time through greater automation.
Detecting errors quickly, enhancing security, and saving time: automation offers all this and more. It eliminates the need for manual work throughout the software development cycle. It is no wonder that industrialization played an essential role in 2021.
So there are six Cs of DevOps, which are:
Continuous optimization and feedback
Continuous monitoring
Continuous deployment and release
Continuous testing
Collaborative development
Ongoing business plans
must integrate with each of these components in the coming years to be more efficient.
| The container of choice is Kubernetes.
Kubernetes has been a massive success in 2019, and this reign will continue in 2021 and beyond. Several valuable features and improved experiences have led developers to rely on Kubernetes more than others.
Kubernetes will help businesses in terms of scalability, portability, and automation, which is why it is considered one of the best container technologies. There will be new and better features of Kubernetes in the coming years that will support reliable and efficient distributed workloads in different environments.
| The growing popularity of Golang
Like Kubernetes, Docker, Helm, and Etcd, Google’s Golang is a programming tool that lends itself well to DevOps tools. It’s a new language compared to the other options, but it fits well with DevOps goals such as software and application portability, modularity, performance efficiency, and scalability.
Leading brands such as YouTube, Dropbox, Apple, Twitter, and Uber have adopted this cloud-based programming language.
With support from Google and ideal for DevOps, the future of Golang looks bright. DevOps teams have either already started using it or are planning to deploy it shortly. In the end, it will help organizations develop highly competitive concurrent applications with accurate results for software development companies.
| The growing importance of cloud-based native DevOps
It is possible to ensure better user experiences, better transformation, and innovation management when cloud-native DevOps is adopted. It is the proper use of technology to automatically manage the configuration, monitoring, and deployment of cloud services.
With cloud automation, the software can release faster. Thus, a bright future awaits cloud-based technologies. Oracle believes that 80 percent of enterprise workloads will eventually move to the cloud by 2025.
Moreover, the US Air Force has embraced cloud principles and has agile approaches to developing applications that run on more than one cloud format.
| Growth in the use of mesh services
Organizations can gain several benefits from adopting microservices. Developers use microservices to increase portability, even if it doesn’t make the DevOps team’s job any easier. Operators need to manage large multi-cloud and hybrid deployments.
The emergence of microservices has led to increased use of service networks, which promise to reduce deployment complexity. Service networks provide the ability to observe and manage a network of microservices and their interactions. This composition offers a complete view of services. It helps SRE and DevOps developers with their complex operational needs such as end-to-end authentication, access control, canary deployment, and A/B testing.
You’ll see an increase in adoption and offerings as these are critical components to running successful microservices. Service mesh is the crossroads that the enterprise must cross when moving from monoliths to microservices.
Conclusion
The DevOps field is constantly growing, and the future bodes well for it. Organizations all over the world are using it because of the many benefits it brings to their businesses. Keep a center on the most modern trends in such a situation because when DevOps reaches new heights, it will drag your business down with it. If you are looking for a software development team there are many software development companies that may help you in growing the buisness.
Cognitive cloud computing are self-learning systems that imitates the human brain with the help of machine learning. It is often mentioned as the third age of computing that works by utilizing big data analytics and deep learning algorithms.
Factors Propelling Market Growth
Cognitive cloud computing system can combine information or data from different sources such as natural language processing, data mining, and pattern recognition and suggest the most suitable strategic approaches for businesses.
These are the benefits of the advanced technology that has enhanced the adoption of cognitive cloud computing infrastructure by industry verticals including healthcare, BFSI, and retail. This is one of the major factor behind the growth of the market. Another factor behind the market growth is the growing application of artificial intelligence in cognitive cloud computing. Especially in healthcare sector, cognitive computing with AI technologies are being used to in oncology sector to develop suitable medicine. Furthermore, rising implementation of cognitive cloud computing model in OTT sector for high quality video streaming is expected to enhance the market growth.
Recent Advances in the Market
According to a recent report by Research Dive, the most significant players of the global cognitive cloud computing market players include SparkCognition, CognitiveScale, Microsoft, Nuance Communications, Numenta, Cisco, SAP, EXPERT.AI, Hewlett Packard Enterprise Development LP, and IBM. These companies are working towards the further growth of the market by developing various strategies such as merger & acquisition, partnership and collaboration, and product launches etc.
Some of the most recent developments of the market has been mentioned below:
According to a most recent news, Nuance Communications, Inc. has been ranked the No.1 Solutions Provider by Black Book Research in five categories. Medical speech recognition and AI technologies are two main categories. The rank was entitled to the company based on 3,250 survey responses from 203 hospitals and 2,263 physician practices. Undoubtedly, the rankings validates Nuance’s unparalleled understanding of clients’ needs, farsighted strategy.
As per a latest news, SparkCognition has announced a collaboration with Cendana Digital, a data science solutions company. SparkCognition is considered as the world’s leading industrial AI Company. The partnership is aimed at the expansion of SparkCognition’s global presence to bring advanced AI solutions to the oil and gas market of Malasiya. The partnership was announced on July 9, 2020.
In June, 2020, SparkCognition and Siemens has initiated a new partnership on a cybersecurity system, DeepArmor Industrial, stimulated by Siemens. This system is designed to protect operational technology (OT), endpoint, or remote assets across the energy value chain by leveraging artificial intelligence (AI) to observe and identify cyberattacks.
This is an innovative AI-enabled system which will provide the benefits of next-generation antivirus, application control, threat detection, and immediate attack prevention to endpoint power generation, oil and gas, and transmission and distribution assets, which will fleet level cybersecurity monitoring and protection capabilities to the energy industry for the first time.
A recent news reveals that Hewlett Packard Enterprise (HPE) is going to initiate a partnership with Wipro Limited. The business giants aims to cooperatively convey their portfolio of hybrid cloud and infrastructure solutions as a service through this partnership. This partnership will enable Wipro to leverage HPE GreenLake across its managed services portfolio and offer a pay-per-use model. This model is agile, elastic, subscription-based, and offers an unfailing cloud experience to the consumers. This will give the customers the benefit of fast tracking their digital transformation efforts by eradicating the necessity for upfront capital investment and overprovisioning costs, while appreciating the added benefits of security, compliance, and on-premises control.
Impact of COVID-19 on the Industry
The coronavirus outbreak has impacted all industries in some way. However, for cognitive cloud computing market is has been proved to be very beneficial. The main attributor of this growth is the rising demand of natural language processing technique in the healthcare and pharmaceutical organizations. This technology has been used to support the healthcare professionals and scientists during the pandemic. NLP technique has been proved to be the most advanced approach to better patient monitoring and patient care. Moreover, being an automated process, the NLP technique allows the healthcare staff to manage and monitor patient more effectively. These are the main factors behind the growth of the cognitive cloud computing market during the pandemic.
About Us: Research Dive is a market research firm based in Pune, India. Maintaining the integrity and authenticity of the services, the firm provides the services that are solely based on its exclusive data model, compelled by the 360-degree research methodology, which guarantees comprehensive and accurate analysis. With unprecedented access to several paid data resources, team of expert researchers, and strict work ethic, the firm offers insights that are extremely precise and reliable. Scrutinizing relevant news releases, government publications, decades of trade data, and technical & white papers, Research dive deliver the required services to its clients well within the required timeframe. Its expertise is focused on examining niche markets, targeting its major driving factors, and spotting threatening hindrances. Complementarily, it also has a seamless collaboration with the major industry aficionado that further offers its research an edge.
Artificial Intelligence is one of the biggest technological waves that have hit the world of technology. According to research from Gartner, artificial intelligence will create a business value worth US$3.9 trillion by 2022. Globally the artificial Intelligence market will grow at a rate of 154 percent. This resulted in the high demand for AI engineers today.
With the growing demand for AI, many individuals are considering it as a career option. In this article, let’s understand the step-by-step process of becoming an artificial intelligence professional.
Step-1: One of the crucial requirements for an individual, who is seeking a career in the field of AI must good at numbers, i.e. they should polish their basic math skills. This will help in writing better code.
Step-2: In this step, one must strengthen their roots, on those concepts that play a vital part in this field. These are the following concepts:
Linear algebra, probability, and statistics – As mentioned before mathematics is an integral part of AI. And if an individual wants to make a growing career in it, then they must have good knowledge of the concepts of advanced math. They are vectors, matrices, statistics, and dimensionality, and must also have knowledge of probability concepts like the Bayes Theorem.
Programming languages – The most crucial aspect is that an individual should be learning programming languages, as they play a prominent role in AI. One can enroll in an AI engineer certification course to learn the programming languages. There are several programming languages, an individual should choose at least one among the following to learn and perfect:
Python
Java
C
R
Data structures – Enhance the way to solve problems involving data, create an analysis of data more accurately so one can develop their own systems with minimum errors. Learn the different parts of programming languages, which will be useful in getting an understanding of data structures like stacks, linked lists, dictionaries, etc.
Regression – Regression is very helpful for making predictions in real-time applications. It is very important to have good knowledge of the concepts of regression.
Machine learning models – Gain knowledge on the various machine learning concepts, which include Decision trees, Random Forests, KNN, SVM, etc. Learn the ways to implement these by understanding the algorithms as they are quite helpful in solving the problems
Step-3: In this step, the artificial intelligence professionals must learn more in-depth concepts, which are a complex part of AI. If one master these concepts then they can excel in their career in the field of AI.
Neural networks – It is a computer system modeled on the human brain and nervous system, which works by incorporating data through an algorithm it is developed on. The concepts of neural networks are the foundations for building AI machines, it is better to have a deep understanding of its functionalities.
There are different kinds of neural networks, which are used in various ways. Some of the common neural networks are:
Perceptron
Multilayer perceptrons
Recurrent neural network
Sequence to sequence model
Convolutional neural network
Feedforward neural network
Modular neural network
Radial basis function neural network
Long Short-Term Memory (LSTM)
Domains of AI – After gaining knowledge about the concepts and different kinds of neural networks, learn about the various applications of the neural networks, it will be helpful to build one’s own applications. Every application in the AI field demands a different approach. The artificial intelligence professionals must start with a specific domain, and then can proceed further, to master all the fields of AI.
Big data – Though it is not considered a crucial part of gaining expertise in AI, understanding the basic concepts of big data will be fruitful.
Step-4: This is the last step in the process of becoming an expert AI professional. Following things are required to be a master in the field of AI:
Optimization techniques – By learning optimization of algorithms helps to maximize or minimize the error function. These functions are based on the models’ internal learnable parameters that play a key role in the accuracy and efficiency of results. Learning this will be helpful to apply optimization techniques and algorithms to model parameters, which are useful to attain optimum values and accuracy of such parameters.
Publish research papers – One of the best ways to establish one’s own credibility in the field of AI is by going a step forward by reading research papers in this field and publish research papers. Start your own research and understand the cases that are in the process of developing.
Develop algorithms – After completing the process of learning and research, start working on developing algorithms. You might bring a new revolution with the knowledge you have.
Conclusion
The aforementioned steps will ensure an individual sail through the learning path of AI. Undoubtedly, mastering all the skills can be a difficult task. But one can achieve it with hard work, continuous practice, and consistency.
As a machine learning professional, I have worked for several startups ranging from zero to 600 employees, as well as companies such as eBay, Wells Fargo, Visa and Microsoft. Here I share my experience. A brief summary can be found in my conclusions, at the bottom of this article.
It is not easy to define what a startup is. The first one I worked for was NBCi, a spinoff of CNET, and had 600 employees almost on day one, and nearly half a billion dollars in funding, from GE. The pay was not great especially for San Francisco, I had stock options but the company went the way many startups go: it was shut down after two years when the Internet bubble popped, so I was able to only cash one year worth of salary from my stock options. Still not bad, but a far cry from what most people imagine. I was essentially the only statistician in the company, though they had a big IT team, with many data engineers, and were collecting a lot of data. I quickly learned that my best allies were in the IT department, and I was the bridge between the IT and the marketing department. I was probably the only “neutral” employee who could talk to both departments, as they were at war against each other (my boss was the lead of the marketing department). I also interacted a lot with the sales, product, and finance teams, and executives. I really liked that situation though, and the fact that there was a large turnover, allowing me to work with many new people (thus new connections and friends) on a regular basis, and on many original projects. The drawback: I was the only statistician. It was not an issue for me.
When people think about startups, many think about a company starting from scratch, with 20 employees, and funded with VC money. I also experienced that situation, and again, I was the only statistician (actually chief scientist and co-founder) though we also had a strong IT team. It lasted a few years until the 2008 crash, I had a great salary, and great stock options that never materialized. But they eventually bought one of my patents. I was hired as co-founder because I was (back then) the top expert in my field: click fraud detection, and scoring Internet traffic for advertisers and publishers. Again, I was the only machine learning guy, and not involved with live databases other than to set the rules and analyze the data. And to conceptually design the dashboard platform for our customers. I was interacting with various people from various small teams, occasionally even with clients, and prototyping solutions and working on proofs of concept – some helped us win a few large customers. I was in all the big meetings involving large, new clients, sometimes flying to the client’s location. This is one of the benefits of working as a sole data scientist. Another one, especially if you have specialized, hard-to-find skills (earned by running small businesses on the side), is that I worked remotely, from home.
Yet another startup, the last one I co-founded, structured as a S-corp, had zero employee, no payroll, no funding, no CEO, and no office or headquarter (the official address, needed for tax purposes, was listed as my home address). It had no home-made Internet platform or database: this was inexpensively outsourced. We were working with people in different countries, our IT team (a one-man operation) was in Eastern Europe. This is the one that was acquired recently by a tech publisher, and my most successful exit. It still continues to grow very nicely today, despite (or thanks to) Covid. It started bare-bone unlike the other ones, making its survival more likely to happen, with 50% profit margins. However, people working with us were well paid, offered a lot of flexibility, and of course everyone was always working from home. We only met face-to-face when visiting a client. No stock options were ever issued; I made money in a different way. I was interacting mostly with sales, and also contributing content and automatically growing our membership using proprietary techniques of my own, that outsmarted all the competitors.
As for the big companies I worked for, I will say this. At Wells Fargo, I was part of a small group (about 100 people) with open office, relatively low hierarchy, and all the feelings of working for a startup. I was told that this was a special Wells Fargo experiment that the company reluctantly tried, in order to hire a different type of talent. It is unusual to be in such a working environment at Wells Fargo. To the contrary, Visa looked more like a big corporation, with many machine learning people each working on very specialized tasks, and a heavier hierarchy. Still I loved the place, and it really helped grow my career. The data sets were quite big, which pleased me. One of the benefits of working for such a company is the career opportunities that it provides. Finally, it is possible to work for a startup within a big company, in what is called a corporate startup. My first example, NBCi, illustrates this concept; in the end I was indirectly working for GE or NBC and even met with the GE auditing team and their six-sigma philosophy. Many of the folks they brought to the company were actually GE and NBC internal employees.
Conclusion
Finding a job at a startup may be easier than applying for positions at big companies. If you have solid expertise, the salary might even be better. Stocks options could prove to be elusive. The job is usually more flexible and requires creativity; you might be the only machine learning employee in the company, interacting with various teams and even with clients. Projects can potentially be more varied and interesting, and the environment is usually fast-paced. Working from home is usually an option. You may report directly to the CEO; the hierarchy is typically less heavy. It requires adaptation and may not be a good fit for everyone. You can also work for a startup within a big corporation: it is called a corporate startup. Working for a big company may be a better move for your career, especially if your plan is to work for big companies in the future. Of course, startups also try to attract talent from big companies.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent’s articles and books, here.
This week was a hallmark for this editor. With a jab of a needle, I entered the ranks of the inoculated, and in the process realized how much the world (and especially the way that we earn a living) has changed.
Many companies are grappling with what to do post-pandemic, even as COVID-19 continues to persist worryingly in the background. Do they bring workers back to the office? Will they come if called? Does it make sense to go fully virtual? To embrace a hybrid model?
Workers too are facing uncertainty, not all of it bad. The job market is tightening, and companies are struggling to find workers. Work from home (or more properly, work from anywhere) has proven to be wildly popular, and in many cases, people are willing to walk away from tens of thousands of dollars a year for the privilege. This has forced companies, already struggling with getting new employees, to reconsider how exactly they should interact with their workforce to a degree unthinkable before the pandemic. Much of this comes down to the fact that AI (at a very broad level) is reducing or even eliminating the need to be in office for most people. Indeed, one of the primary use-cases of AI is to be both vigilant when problems or opportunities arise and to be flexible enough to know who to call when something does go wrong.
On a related front, AIs are increasingly taking over in areas that may be seen as fundamentally creative. Already, generated personalities are becoming brand influencers as GANs become increasingly sophisticated. Similarly, Google’s GPT-3 engine is beginning to replace writers in generating things like product descriptions and press releases. Additionally, robot writers are making their way into generating working code based upon the intent of the “programmer”, a key pillar of the no-code movement.
Finally, robotic process automation is hacking away at the domain of integration, tying together disparate systems with comparatively minimal involvement of human programmers. Given that integration represents upwards of 40% of all software being written at any given time, the impact that this has upon software engineers is beginning to be felt throughout the sector. That this frees up people to deal with less-repetitive tasks is an oft-stated truism, but this also has the impact of changing people from being busy 24-7 to be available only in a more consultative capacity, with even highly skilled people finding those skills utilized in a more opportunistic fashion.
The nature of work is changing, and those people who are at the forefront of that change are now involved in an elaborate dance to determine a new parity, one where skillsets and the continuing need to acquire them are adequately compensated, and where automation is tempered with the impacts that automation has on the rest of society.
These issues and more are covered in this week’s digest. This is why we run Data Science Central, and why we are expanding its focus to consider the width and breadth of digital transformation in our society. Data Science Central is your community. It is a chance to learn from other practitioners, and a chance to communicate what you know to the data science community overall. I encourage you to submit original articles and to make your name known to the people that are going to be hiring in the coming year. As always let us know what you think.
Data integration is defined as gathering data from multiple sources to create a unified view. The process of the consolidated data avails users with consistent access to their data on a self-service basis. This gives you a complete picture of key performance indicators (KPIs), customer journeys, market opportunities, and so on.
Following are a list of seven reasons why you need a data integration strategy for your organizations
Keeps up with the evolution of data
Sensors, networking, and cloud storage are all becoming more affordable, resulting in a vast amount of data. AI and machine learning technology can make sense of it all, with capabilities far exceeding those of humans. All that is required is for data from all sources to be combined, and the algorithms will work!
Makes data available
Accessible data is a huge benefit for your business; it’s as easy as that! Imagine that all of your company’s employees, or your business partners, could have access to centralized data. It will be easier and encouraging for your personnel to make reports and keep all processes up to date. Making reports and keeping all processes up to date will be easier and much more encouraging for your personnel.
Eliminating security issues
Having access to all forms of continuously updated and synchronized data makes it easier to use AI and machine learning solutions to analyze any suspicious activity and decide how to handle it, or even set up automatic algorithms.
Improve data transparency
With a data integration plan, you can improve all of your interfaces and handle complexity while obtaining maximum results and the best information delivery
Makes data more valuable
Data integration adds more value to the data. In DI solutions, Data quality approaches are becoming more common; these techniques discover and improve data characteristics, making it cleaner, more consistent, and more complete. Because the datasets are aggregated and calculated, they become more useful than raw data.
Simplifying data collaboration
Integrated and available data opens up a whole new universe of possibilities for internal and external collaboration. With the available data in the correct format, anyone depending on your statement will have a far more effective impact on the processes.
Fueling smarter business decisions
Using organized repositories with several integrated datasets, you may achieve a remarkable level of transparency and knowledge throughout the entire organization. Never before accessible nuances and facts will now be in your hands, allowing you to make the correct decisions at the right moment.
The correct data integration methods can translate into insights and innovation for years to come. Consider your needs, your goals, and which type of approach matches both, so you make the best decision for your business.
Among the many decisions you’ll have to make when building a predictive model is whether your business problem is either a classification or an approximation task. It’s an important decision because it determines which group of methods you choose to create a model: classification (decision trees, Naive Bayes) or approximation (regression tree, linear regression).
This short tutorial will help you make the right decision.
Classification – when to use?
Classification works by looking for certain patterns in similar observations from the past and then tries to find the ones which consistently match with belonging to a certain category. If, for example, we would like to predict observations:
Is a particular email spam? Example categories: “SPAM” & “NOT SPAM”
Will a particular client buy a product if offered? Example categories: “YES” & “NO”
What range of success will a particular investment have? Example categories: “Less than 10%”, “10%-20%”, “Over 20%”
Classification – how does it work?
Classification works by looking for certain patterns in similar observations from the past and then tries to find the ones which consistently match with belonging to a certain category. If, for example, we would like to predict observations:
With researched variable y with two categorical values coded blue and red. Empty white dots are unknown – could be either red or blue.
Using two numeric variables x1 and x2 which are represented on horizontal and vertical axes. As seen below, an algorithm was used which calculated a function represented by the black line. Most of the blue dots are under the line and most of the red dots are over the line. This “guess” is not always correct, however, the error is minimized: only 11 dots are “misclassified”.
We can predict that empty white dots over the black line are really red and those under the black line are blue. If new dots (for example future observations) appear, we will be able to guess their color as well.
Of course, this is a very simple example and there can be more complicated patterns to look for among hundreds of variables, all of which is not possible to represent graphically.
Approximation – when to use?
The approximation is used when we want to predict the probable value of the numeric variable for a particular observation. An example could be:
How much money will my customer spend on a given product in a year?
What will the market price of apartments be?
How often will production machines malfunction each month?
Approximation – how does it work?
Approximation looks for certain patterns in similar observations from the past and tries to find how they impact the value of a researched variable. If, for example, we would like to predict observations:
With numeric variable y that we want to predict.
With numerical variable x1 with value that we want to use to predict the first variable.
With categorical variable x2 with two categories: left and right, that we want to use to predict the first variable.
Blue circles represent known observations with known y, x1, x2.
Since we can’t plot all three variables on a 2d plot, we split them into two 2d plots. The left plot shows how the combination of variables x1 and x2=left is connected to the variable y. The second shows how the combination of variables x1 and x2=right is connected to the variable y.
The black line represents how our model predicts the relationship between y and x1 for both variants of x2. The orange circle represents new predictions of y on observation when we only know x1 and x2. We put orange circles in the proper place on the black line to get predicted values for particular observations. Their distribution is similar to blue circles.
As can clearly be seen, distribution and obvious pattern of connection between y and x1 is different for both categories of x2.
When a new observation arrives, with known x1 and x2, we will be able to make new predictions.
Discretization
Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation, for instance, if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable) and 0. You can also split the numerical variable into multiple subgroups: apartment prices for low, medium, and high by the equal subset width, and predict them using classification algorithms. This process is called discretization.
In this article, I illustrate the concept of asymmetric key with a simple example. Rather than discussing algorithms such as RSA, (still widely used, for instance to set up a secure website) I focus on a system easier to understand, based on random permutations. I discuss how to generate these random permutations and compound them, and how to enhance such a system, using steganography techniques. I also explain why permutation-based cryptography is not good for public key encryption. In particular, I show how such as system can be reverse-engineered, no matter how sophisticated it is, using cryptanalysis methods. This article also features some nontrivial, interesting asymptotic properties of permutations (usually no taught in math classes) as well as the connection with a specific kind of matrices, yet using simple English rather than advanced math, so that this article can be understood by a wide audience.
1. Description of my public key encryption system
Here x is the original message created by the sender, and y is the encrypted version that the receiver gets. The original message can be described as sequence of bits (zeros and ones). This is the format in which is it is internally encoded on a computer or when traveling through the Internet, be it encrypted or not, as computers only deal with bits (we are not talking about quantum computers or quantum Internet here, which operate differently).
The general system can be broken down into three main-components:
Pre-processing: blurring the message to make it appear like random noise
Encryption via bit-reshuffling
Decryption
We now explain these three steps. Note that the whole system processes information by blocks, each block (say 2048 bits) being processed separately.
1.1. Blurring the message
This steps consist in adding random bits at the end of each block (sometimes referred to as padding), then perform a XOR to further randomize the message. The bits to be added consist of zeroes and ones in such a proportion that the resulting, extended block has roughly 50 percent of zeroes and ones. For instance, if the original block contains 2048 bits, the extended blocks may contain up to 4096 bits.
Then, use a random string of bits, for instance 4096 binary digits of square root of two, and to a bitwise XOR (see here) with the 4096 bits obtained in the previous step. The resulting bit string is the input for the next step.
1.2. Actual encryption step
The block to be encoded is still denoted as x, though it is assumed to be the input of the previous step discussed in section 1.1, not part of the original message. The encryption step transforms x into y, and the general transformation can be described by
Here * is an associative operator, typically the matrix multiplication or the composition operator between two functions, the latter one usually denoted as o as in (f o g)(x) = f(g(x)). The transforms K and L can be seen as permutation matrices. In our case they are actual permutations whose purpose is to reshuffle the bits of x, but permutations can be represented by matrices. The crucial element here is that L * K = L^n = I (that is, L at power n is the identity operator): this allows us to easily decrypt the message. Indeed, x = L * y. We need to be very careful in our choice of L, so that the smallest n satisfying L^n = I is very large. More on this in section 2. This is related to the mathematical theory of finite groups, but the reader does not need to be familiar with group theory to understand the concept. It is enough to know that permutations can be multiplied (composed), elevated to any power, or inversed, just like matrices. More about this can be found here.
That said, the public and private keys are:
Public key: K (this all the sender needs to know to encrypt the block x as as y = K * x)
Private keys: n and L (kept secret by the recipient); the decrypted block is x = L * y
1.3. Decryption step
I explained how to retrieve the block x in section 1.2 when you actually receive y. Once a block is decrypted, you still need to reverse the step described in section 1.1. This is accomplished by applying to x the same XOR as in section 1.1, then by removing the padding (the extra bits that were added to pre-process the message).
2. About the random permutations
Many algorithms are available to reshuffle the bits of x, see for instance here. Our focus is to explain the most simple one, and to discuss some interesting background about permutations, in order to reverse-engineer our encryption system (see section 3).
2.1. Permutation algebra: basics
Let’s begin with basic definitions. A permutation L of m elements can be represented by a m-dimensional vector. For instance L = (5, 4, 1, 2, 3) means that the first element of your bitstream is moved to position 5, the second one to position 4, the third one to position 1, and so forth. This can be written as L(1) = 5 , L(2) = 4, L(3) = 1, L(4) = 2, and L(5) = 3. Now the square of L is simply L(L), and the n-th power is L(L(L…))) where L appears n times in that expression. The order of a permutation (see here) is the smallest n such that L^n is the identity permutation.
Each permutation is made up of a number of usually small sub-cycles, themselves treated as sub-permutations. For instance, in our example, L(1) = 5, L(5) = 3, L(3) = 1. This constitutes a sub-cycle of length 3. The other cycle, of length 2, is L(2) = 4, L(4) = 2. To compute the order of a permutation, compute the orders of each sub-cycle. The least common multiple of these orders is the order of your permutation. The successive powers of a permutation have the same sub-cycle structure. As a result, if K is a power of L, and L has order n, then both L^n and K^n are the identity permutation. This fact is of crucial importance to reverse-engineer this encryption system.
Finally, the power of a permutation can be computed very fast, using the exponentiation by squaring algorithm, applied to permutations. Thus even if the order n is very large, it is easy to compute K (the public key). Unfortunately, the same algorithm can be used by a hacker to discover the private key L, and the order n (kept secret) of the permutation in question, once she has discovered the sub-cycles of K (which is easy to do, as illustrated in my example). For the average length of a sub-cycle in a random permutation, see this article.
2.2. Main asymptotic result
The expected order n of a random permutation of length m (that is, when reshuffling m bits) is
For details, see here. For instance, if m = 4,096 then n is approximately equal to 6 x 10^10. If m = 65,536, then n is approximately equal to 2 x 10^37. It is possible to add many bits all equal to zero to the block being encrypted, to increase its size m and thus n, without increasing too much the size of the encrypted message after compression. However, if used with a public key, this encryption system has a fundamental flaw discussed in section 3, no matter how large n is.
2.3. Random permutations
The easiest way to produce a random permutation of m elements is as follows.
Generate L(1) as a pseudo random integer between 1 and m. If L(1) = 1, repeat until L(1) is different from 1.
Assume that L(1), …, L(k-1) have been generated. Generate L(k) as a pseudo random integer between 1 and m. If L(k) is equal to one of the previous L(1), …, L(k-1), or if it is equal to k, repeat until this is no longer the case.
Stop after generating the last entry, L(m).
I use binary digits of irrational numbers, stored in a large table, to simulate random integers, but there are better (faster) solutions. Also, the Fisher-Yates algorithm (see here) is more efficient.
3. Reverse-engineering the system: cryptanalysis
To reverse-engineer my system, you need to be able to decrypt the encrypted block y if you only know the public key K, but not the private key L nor n. As discussed in section 2, the first step is to identify all the sub-cycles in the permutation K. This is easily done, see example in section 2.1. Once this is accomplished, compute all the orders of these sub-cycle permutations and compute the least common multiple of these orders. Again, this is easy to do, and this allows you to retrieve n even though it was kept secret. Now you know that K^n is the identity permutation. Compute K at power n-1, and apply this new permutation to the encrypted block y. Since y = K * x, you get the following:
Now you’ve found x, problem solved. You can compute K at the power n-1 very fast even if n is very large, using the exponentiation by squaring algorithm mentioned in section 2.1. Of course you also need to undo the step discussed in section 1.1 to really fully decrypt the message, but that is another problem. The goal here was simply to break the step described in section 1.2.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent’s articles and books, here.
With stagnating consumption and restrictions on individual and merchandise flow, the footwear and apparel industry is quickly overhauling its strategy for the “new normal”. COVID-19 is forcing the industry to be more focused, sustainable and efficient in its processes. In doing so, it is also forcing the industry to become more digital savvy. The challenge before industry leaders is to be more “connected” across the value chain, while being “physically disconnected”. How will the industry operations look a few months from now as it is re-shaped by the crisis?
If anything is certain, it is this: the industry will adopt additional digital assets to drive product development, adopt technology solutions that connect processes and move business towards more sustainable and resilient processes.
The PLM solutions that the footwear and apparel industry uses were not designed for an eventuality like COVID-19. As an example, there is over reliance on physical samples, which right now are difficult to make and ship. This has presented a major challenge in the development process. Add to that the problem of getting multiple iterations through the sampling process. For an industry where “physical review” and “in-use” evaluation of the performance characteristics of materials are of immense importance, this can be very problematic. The challenge is to ensure that the stakeholders are connected and are able to meaningfully perform their activities (in case of samples) in a collaborative environment, while being able to work around the lockdown and lack of co-location imposed by COVID-19.
Brooks Running, the brand behind high-performance running shoes, clothing, and accessories headquartered in Seattle, USA, is finding ways to address the challenges. “We are now dipping our toes in 3D design and modelling,” says Cheryl Buck, Senior Manager, IT Business Systems, Brooks Running, who is responsible for the FlexPLM implementation in her organization. “We are examining visual ways to create new products and ensuring extremely tight communication (between functions).”
The goal is to improve collaboration, build end-to-end visibility coupled with accurate product development information (schedules, plans, resources, designs, testing, approvals, etc.) that reduces re-work and delivers faster decisions. The key to enabling this is to make real-time data and intuitive platforms with a modern UI available to all stakeholders. The result is improved productivity and efficiency, despite the barriers placed by COVID-19 for business as we had known earlier. Smart manufacturers like Brooks Running are going one step further. They are using role-based applications to extend the ability of their PLM (see figure below). These apps cater to the needs of individual users.
(App segmentation is illustrative, two or more segments can be combined to create apps to support business needs)
Brooks Running has found that the app ecosystem meets several process expectations. “It is difficult to configure PLM to individual needs,” observes Buck, “So providing special functional apps is a great way to address the gaps.” A readily available API framework allows Brooks Running to exchange information between systems and applications. The framework is flexible to support scale and is also cloud compatible.
For Brooks Running, COVID-19 has become a trigger for being creative and smoothening its processes. The organization has installed high resolution cameras to better view products and materials being shared in video conferences, and is incorporating 3D modelling into its processes. This makes it easier for stakeholders to assess designs as they can see more detail (testing has been moved to locations that are less affected by COVID-19 and when that is impossible, Brooks Running uses virtual fitting applications). The end result is that teams can easily share product development, acquire feedback and move ahead. The system keeps information on plans and schedules updated, shares status and decisions, and keeps things transparent.
Brooks Running finds that a bespoke Style Tracker application designed for the organization is proving to be extremely helpful in that it provides a reference point for styles in their development path within the season, tracks due dates for stakeholders, signals what needs to be done to get to the next check point and provides a simple way for leadership to track progress. “The style tracking app is a big win for us,” says Buck.
The experience of Brooks Running provides footwear and apparel retailers with a new perspective on the possibility to improve PLM outcomes and ROI:
Provide user group/process specific and UI rich tools
Enable actionable insights
Leverage single or multiple source of data
Enable option for access across mobile platforms
Make future upgrades easy and economical
Provide an easy way for incorporating latest technology platforms like 3D, Artificial Intelligence, RPA, Augmented Reality/Virtual Reality
FlexPLM is a market leader in the retail PLM space and combining it with ITC Infotech’s layer of apps provides FlexPLM partners with an easy, efficient and scalable mechanism to align the solution to the needs and expectations of the individual business in a way that has minimum impact on future upgrades, which is a significant plus. In addition, the app framework makes alignment across different configuration easy, which is an added advantage. Organizations that don’t want to modify their Flex implementation will find immense appeal in the app ecosystem to extend the capabilities of their PLM—especially as COVID-19 makes it necessary to bring innovation and ingenuity to the forefront.
Authors:
Cheryl Buck
Senior Manager, IT Business Systems, Brooks Running