Search for:
data warehouse data lake data mart data hub a definition of terms
Data Warehouse, Data Lake, Data Mart, Data Hub A Definition of Terms

In today’s business environment, most organizations are overwhelmed with data and looking for a way to tame the data overload and make it more manageable to help team members gather and analyze data and make the most of the information contained within the walls of the enterprise. When a business enters the domain of data management, it can often get lost in a morass of terms and concepts and find it nearly impossible to sort through the confusion. Without a clear understanding of the various categories and iterations of data management options, the business may make the wrong choice or become so mired in the review process that it will give up its quest.

 

This article is the first of two on the topic of Data Management. Here, we will define the various terms so that a business can more easily understand the types of data management solutions and tools. In the second of these two articles entitled, ‘Factors and Considerations Involved in Choosing a Data Management Solution’e discuss the various factors and considerations that a business should include when it is ready to choose a data management solution.

 

Data Warehouse

A Data Warehouse (AKA Datawarehouse, DWH, Enterprise Data Warehouse or EDW) solution is designed to centralize and consolidate large bodies of data from disparate, multiple sources and is meant to help users execute queries, perform analytics, provide reporting, and obtain business intelligence. Data Warehouse data is typically comprised of data from applications, log files and historical transactions and integrates and stores data from relational databases and other data sources originating in various business units and operational entities within the enterprise, e.g., sales, marketing, HR, finance.

 

A Data Warehouse is a structured environment that is comprised of one or more databases and organized in tiers. An interactive, front-end tier provides search results for reporting, analytics and data mining. The search engine accesses and analyzes the data for presentation and the foundational architecture or database server provides the storage and loading repository.

In order to prepare data for analysis, a Data Warehouse environment will typically utilize an Extraction, Transformation and Loading (ETL) process to prepare data for analysis. Team members who access a Data Warehouse may use SQL queries, analytical solutions or BI tools to mine the data, report, visualize, analyze and present the data. 

 

Data Mart

We can think of a Data Mart as a subset of a Data Warehouse but, whereas a Data Warehouse is an enterprise-wide solution that comprises data from across the organization, the Data Mart is a structured environment that is used to store and present data for a specific team or business unit. This approach allows a business team or unit to curate, leverage and manipulate data that is specific to their teams. For example, a business might create a Data Mart to serve its Marketing, Sales and Advertising teams or it might expand that use to include Customer Service and Product teams so that it can more easily analyze and collaborate using data culled from specific sources within these business units.

 

While Data Warehouses access and analyze large volumes of records, a Data Mart improves the response time and performance for end-users by refining the data to provide only data that will support the collective needs of a specified group of users.

 

Think of a Data Mart as a ‘subject’ or ‘concept’ oriented data repository. A Data Mart often provides a subset of data from a larger Data Warehouse and is designed for ease of consumption, to produce actionable insight and analysis for a particular group.

 

Data Lake

A Data Lake is a less structured and more flexible approach to data management with data streaming in from various sources and a more free-wheeling approach to data access, exploration and sampling. A Data Lake stores data with no organization or hierarchy. All data types are stored in raw form or semi-transformed format and data is only organized for presentation and use as queries or requests are generated.

 

A Data Lake can store structured (relational databases, rows columns), semi-structured (XML, ISON, Logs, CSV) and unstructured or binary (Word documents, PDF formats, images, email, audio or vide0) data, and acts as repository of various data sources and users can use that data for various types of analytics from visualization to dashboard presentation, machine learning and data processing.

 

Data Hub

A Data Hub solution is typically a more flexible, personalized approach to data management with various integration technologies and solutions overlaid to provide the structure or output needed by the business. The data flows from various sources – not all of which will be operational. A Data Hub can provide data in various formats and perform actions to refine data for quality, security, duplicate removal, aged data, etc.

 

The Data Hub is meant to collect and connect data to produce insight for collaboration and data sharing. It will act as an integration and data processing hub to connect data sources and make them more readily accessible and usable for team members. The definition of a Data Hub will vary by business use and by organization as the parameters and organization of the hub environment will flex to the needs of the organization. So, factors like available models, data governance and access, data persistence and analytical formats and reporting options will vary.

 

As you consider the various solutions and options for data management, be sure to develop and use a comprehensive and detailed set of requirements and elicit feedback from those who will use and manage the solution.

 

Now that you understand the various Data Management options, you are ready to select an option for your business. The second of our two-article series, entitled, ‘Factors and Considerations Involved in Choosing a Data Management Solution’ will provide some simple suggestions and recommendations to help you choose the right option.

Source Prolead brokers usa

your guide for the commonly used machine learning algorithms
Your Guide for the Commonly Used Machine Learning Algorithms

We are currently living in such a period where computing has transformed immensely from large mainframes to personal computers to the cloud.  The constant technological progress and the evolution in computing resulted in major automation.

In this article, let’s understand few commonly used machine learning algorithms. These can be used for almost any kind of data problem.

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM algorithm
  5. Dimensionality Reduction Algorithms
  6. Gradient Boosting algorithms and AdaBoosting Algorithm
  • GBM
  • XGBoost
  • LightGBM
  • CatBoost
  1. Linear Regression

This is used to estimate real values like the cost of houses, number of calls, total sales, and many more based on a continuous variable. In this process, a relationship is established between independent and dependent variables by fitting the best line. This best fit line is called the regression line and is represented by a linear equation Y= a *X + b.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

The coefficients a & b are derived depending on minimizing the sum of squared difference of distance between data points and regression line.

  1. Logistic Regression

This is used to evaluate discrete values (mainly Binary values like 0/1, yes/no, true/false) depending on the available set of the independent variable(s). In simple terms, it is useful for predicting the probability of happening of the event by fitting data to a logit function. It is also called logit regression.

These below listed can be tried in order to improve the logistic regression model

  • including interaction terms
  • removing features
  • regularize techniques
  • using a non-linear model
  1. Decision Tree

This is highly used for classification problems. The Decision Tree algorithm is considered among the most popular machine learning algorithm. This perfectly works for both continuous and categorical dependent variables. This is done depending on the most significant attributes/ independent variables to create as distinct groups as possible.

  1. SVM algorithm

This algorithm is a classification method in which the raw data gets plotted as points in n-dimensional space (where n is the number of features that are present). The value of every feature is being the value to a particular coordinate. This makes it quite easy to classify the data. For an instance, if we consider two features like hair length and height of an individual. First, these two variables will be plotted in the two-dimensional space, where every point has two coordinates, these are called Support Vectors.

  1. Dimensionality Reduction Algorithms

Over the last few years, huge amounts of data are been captured at every possible stage and are getting analyzed by many sectors. The raw data also consists of many features but the major challenge is in identifying highly significant variable(s) and patterns. The dimensionality reduction algorithms like Decision Tree, PCA, and Factor Analysis help find the relevant details depending on the correlation matrix, missing value ratio.

  1. Gradient Boosting algorithms and AdaBoosting Algorithm

GBM – These are boosting algorithms that are highly used when huge loads of data have to be taken care of to make predictions with high accuracy. AdaBoost is an ensemble learning algorithm that mixes the predictive power of various base estimators to improve robustness.

XGBoost – This has a major high predictive analysis that makes it the most suitable choice for accuracy in events as it possesses both tree learning algorithms and linear models.

LightGBM – This is a gradient boosting framework that uses tree-based learning algorithms. The framework is a very quick and highly efficient gradient boosting one based on decision tree algorithms. It is designed to be distributed with the mentioned benefits:

  • Parallel and GPU for machine learning supported
  • Quicker training speed and better efficiency
  • Lower memory usage and enhanced accuracy
  • Capable of handling large-scale data

CatBoost – This is an open-sourced machine learning algorithm. It can easily integrate with deep learning frameworks such as Core Ml and TensorFlow. It can work on various data formats.

Whoever is seeking a career in machine learning should understand and increase their knowledge of these algorithms.

Source Prolead brokers usa

why enterprise data planning is crucial for faster outcomes
Why Enterprise Data Planning Is Crucial for Faster Outcomes

Have you ever sat in a meeting where everyone has a different number for the same performance measure? This typically results in spending the next hour trying to reconcile the differences rather than making the important business decisions required. Upon further analysis, it is likely everyone will have the right number according to the system from which it was derived.

The differences can likely be attributed to inconsistent hierarchical master data across these systems. It has existed ever since organizations start implementing more than one business system. But today, the problem is magnified across the many systems most organizations have and by the large numbers of changes today’s business environment generates. It is therefore essential for organizations to effectively manage hierarchical master data across multiple information systems. Organizations need to move beyond the mix of email, spreadsheets and adhoc systems that many currently rely on to execute this extremely important function. Numerous organizations are looking for enterprise software solutions to help them effectively manage these problems without relying on manual processes.

Why is Enterprise Data Planning Important? 

Data is usually shared across many enterprise systems. For example: John (Sales Representative) who works in California (Territory) sells 10,000 (Quantity) of a new widget (Product) to a customer (Customer) based in New York (Geography) for $50,000 (Total Sale) on December 15, 2017 (Date). Taken together, this information is about one transaction, but included in the transaction are individual elements of master data—Sales Representative, Territory, Quantity, Product, Customer, Geography, Total Sale and Date.

Today’s Enterprise Data Planning Challenges

How do most enterprises manage enterprise data today? Remarkably for something so important, they do it through conversations, telephone calls, spreadsheets and e-mail. For example, if a departmental manager wants to add another cost center, or if management wants to move facilities from human resources to finance, the business decision must first be approved by all the relevant decision makers. This takes time. Once the change is approved, IT receives the request to make the change and ensure that it ripples through all of the enterprise’s transactional systems, data warehouses, business intelligence and enterprise performance management solutions. Because changes are made manually, often the end result is a lot of people making a lot of mistakes with a lot of mission critical data—mistakes that go undiscovered due to a lack of visibility or traceability in the process. This is compounded by the sheer number of changes that take place in enterprises today. We constantly cite the increasing rate of change in business which inevitably leads to increasing change in enterprise data.

Modern Enterprise Data Management 

World-class performers experience significant benefits from taking a modern, agile approach to enterprise data management across their entire business systems landscape. Key characteristics of this approach include: 

• Eliminating the need for a formal, upfront data governance program and initiative that requires burdensome commitments including executive sponsorship, agreement on terms and definitions, enterprise policies, and a host of other coordination costs between Business and IT to orchestrate people, time and resources across lines of businesses, divisions or geographies. 

• Taking an elastic approach to managing enterprise data that is evolutionary, iterative, incremental and flexible. One that does not force mastering to achieve desired outcomes, but is fit for purpose based upon desired scope: peer-to-peer within a small workgroup, application-to-application to support local alignment, or enterprise-wide to enable global mastering initiatives as desired based upon the aspirations, capabilities, and maturity of an organization at a point in time. 

• Facilitating easy-to-use, web-based, self-service experiences for streamlined application maintenance, collaborative change management, faster data sharing, and accelerated new application development.

 • Utilizing a request-driven approach to all change management and data hygiene activities in an easy-to-use, self-service experience that promotes timely, accurate changes across a spectrum of business users.

 • Employing a business-driven approach to snapshot historical versions, branch off production data sets to explore what-if scenarios, and merge approved plans into production in a timely manner to drive value among connected business applications. 

• Comparing alternate business perspectives within and across applications to understand differences, and rationalize on a fit-for-purpose basis. 

• Streamlining last mile integration with connected business applications, across public, private, and hybrid cloud environments. 

• Have fully transparent activity trails that enable regulatory compliance and risk mitigation.

PromptCloud can help you in aggregating the data from the web using advanced scraping techniques. Enjoy 100% quality data at the frequency of your choice.

Source Prolead brokers usa

data science a good career choice
Data Science: A Good Career Choice

Data science can be considered to be a new buzzword in the world of technology. Data scientists and big data engineers hold the promise of high pay and excellent job growth. In order to explore this beautiful world of data science, it is essential to know:   

  • What is data science?  
  • Is data science a good career option?
  1. What is data science?  

Data scientists research the source of information, how data fit with each other to create a story, and what the patterns stand for and how they help understand business results.          

On a daily basis, data science means creating statistical models to make projections, econometrics, classification, clustering, simulations, and other objectives, prediction of user behavior through pattern or trend identification, thorough data analysis, conveying of data insights through data visualizations, and data summarization.   

  1. Is data science a good career option? 

Data science has experienced a 650% growth in jobs starting from 2012. Thus, the data science field is experiencing a huge rise in demand and is a good career choice.   

Data science can be considered as an intellectually demanding area of study and work. Much time is required to clean the data, import large datasets, build databases, and maintain dashboards. In order to stand tall as a data scientist, you are required to enjoy quantitative areas and be available about enabling firms to make data-driven decisions.  

As per LinkedIn, SQL is the essential skill asked for for data science jobs. In addition to this, Hadoop and Spark are also at a rise in popularity. You will be required to learn a programming language like Python, SAS, R. Also, you should revise your mathematical skills with a focus on probability and statistics, multivariate calculus, linear algebra. It would help if you also learned data visualization tools, for example, Tableau.  

It is also recommendable to learn coding as a beginner, as even a change in parameter can change results, and there is very little margin for making mistakes. As you advance in your career as a data scientist, you may choose to specialize in machine learning algorithms, natural language processing, and deep learning, many other related areas that have a basis in unstructured data and big data.  

To be successful as a data scientist, you should also possess soft skills like storytelling, teamwork, interpersonal skills, and communication skills. These skills can often not be grasped so well by textbooks but instead develops on the job in collaboration with the tech teams, stakeholders across product, business, and others.  

Benefits of studying and practicing data science  

The several advantages of data science areas give below:  

  1. Data science is in high demand  

According to LinkedIn, it is the fastest-growing career, and it is stated that there will be created 11.5 million data science jobs by 2026. This proves that it is a high-demand job area.   

In addition to the above information, it can be stated that only a very few people have the skills that make up a data scientist. This turns data science into a less saturated area as compared to other information technology areas. Data science is an exceedingly abundant area of work with tons of opportunities. This is so also due to the low supply of data scientists.  

  1. Career and payment in data science  

If you have a career in data science, you will be eligible for a highly paid job position. As per Glassdoor, data scientists earn on average USD 116,100 per annum. Thus, they make a very lucrative job position.   

  1. Data science is a versatile field  

There are a huge number of applications of data science. It is mainly used in banking, health- care, e-commerce industries, consultancy services. Thus, the applicability of data science is versatile. It will thus allow you to work in several fields.  

  1. Data science turns data to a better state  

Firms require skilled data scientists for processing and analyzing their data. Thus, they improve the quality of data as well as analyze it. For this reason, they work with enriching data and making it more useful for their company.  

This is also the reason why data scientists are so crucial for the firm. They make better business decisions. The firms are dependent on them and use their expertise to offer better solutions to the clients. This is the reason why they hold a special position in the company.  

  1. Freedom from boring tasks  

Data scientists have enabled several industries to perform automation of redundant tasks. Firms are making use of historical data to train machines in a way that they can perform repetitive activities. This has made simpler the arduous jobs down by people before.  

Conclusion   

You can get online certified as a Data engineer and data scientist from a highly recommended platform for the same: dasca.org. Your career shall be at definite rise with these globally valued certificates.

 

Source Prolead brokers usa

unleashing the business value of technology part 4 delivering value
Unleashing the Business Value of Technology Part 4: Delivering Value

We are now ready to wrap up the four-part series on how technology vendors – especially data and analytic technology vendors (and what technology vendors are not involved in data and analytics nowadays) – can help their customers to “unleash the business value of their technology investments.”

I wrote this four-part series because in my journeys these past few months, both technology customers and technology vendors bemoaned their frustrations with deriving business value from their technology investments.  And senior IT leadership in particular were coming under increased scrutiny about delivering on the promise of these technology investments.

In Part 1, I introduced the 3 stages of “unleashing the business value” roadmap:  1) Connecting to Value, 2) Envisioning Value and 3) Delivering Value (see Figure 1).

Figure 1: Unleashing the Value of Technology Roadmap

In Part 2, I provided some techniques that enable technology vendors to connect to “value”, but that is “value” as defined by the customer’s business initiatives, not “value” as defined by the technology vendor’s product or services capabilities.  I introduced two fundamental techniques for vendors who want to connect with their customers’ sources of value creation:

  • Step 1: Understanding how your customers make money.  This requires investing the time upfront to research a customer’s key business initiatives and their supporting business and operational use cases.
  • Step 2: Triaging Your customer’s business initiatives.  I introduced the Big Data Strategy Document for helping technology vendors understand where and how data and analytics might support their customers’ key business initiatives.

In Part 3, I reviewed some techniques to help customers “envision” the realm of what’s possible with respect to how data and analytics to derive and drive new sources of customer, product, and operational value.

Now in Part 4, it’s time to bring home the bacon!

De-risk the Path Forward

Part 4 is where the rubber (and vendor commitment) hits the road.  Part 4 provides an opportunity for technology vendors to create a co-creation relationship with their customers to help them unleash the value of their technology investments.

The 3-stage customer engagement model depicted in Figure 2 de-risks the customer’s decision to move forward by putting a large majority of the onus of success on the technology vendor.  This approach provides the technology vendor with the opportunity to prove that they can deliver on the business potential of their customers’ technology investments (see Figure 2).

Figure 2: De-risk the Customer Path Forward

Stage 1: Envisioning Workshop builds off our earlier work in “connecting to value” and “envisioning value” to assure the customer that the solution and outcomes being proposed are relevant and meaningful vis-à-vis their business objectives.  Stage 1 is typically a low-cost, 2 to 3-week collaborative engagement with the key business stakeholders to identify, validate, value, and prioritize the use cases where data and analytics can deliver material business value.

Stage 2: Proof of Value (POV), or sometimes called the Minimal Viable Product (MVP), focuses on proving the business value of the prioritized use case identified in Stage 1. This 4 to 6-month co-creation engagement requires close collaboration between the vendor and the customer to quantify the realized business value while build confidence in the vendor’s ability to deliver said solution.  While many different technology tools will likely be used in Stage 2, the primary focus of Stage 2 is to prove the vendor’s ability to deliver quantifiable, relevant, and actionable business and operational outcomes.

Stage 3 builds upon the business and operational learnings from Stage 1 and reuses the technology assets co-created in Stage 2 to integrate the technology assets into the customer’s operational and management systems.  Having proven in Stage 2 that the vendor can deliver on the business potential, scaling (and governance becomes a key focus in Stage 3 as the underlying technology architecture and infrastructure facilitates the on-going delivery of business outcomes against the different customer use cases.

Establishing the Data Monetization Governance Board

Success breeds success, and after the successful execution of stage 3, more and more business units will also want the opportunity to unleash the business value of their technology investments to deliver meaningful, relevant and actionable business and operational outcomes.  To properly manage the sudden demand – and to avoid the data silos and orphaned analytics that doom the long-term economic value of data and analytics (yea, I wrote the book “The Economics of Data, Analytics, and Digital Transformation” that talks all about that), the customer will need help in establishing a Data Monetization Governance Board.

The Data Monetization Governance Board champions and enforces data and analytic monetization best practices across the organization. The Data Monetization Board has both the responsibility and the authority (the carrot and the stick) to enforce the sharing, reuse and continuous refinement of the organization’s data and analytic assets. Otherwise data monetization will continue to be a haphazard effort with disappointing results.

As I discuss in the blog “Digital Transformation Requires Redefining Role of Data Governance” the Data Monetization Governance Board must:

  • Evangelize a compelling vision to the business executives regarding the economic potential of data and analytic assets to power an organization’s digital transformation.
  • Educate senior executives, business stakeholders and strategic customers on how to “Think Like a Data Scientist” in identifying and envisioning where and how data and analytics can deliver quantifiable and relevant business value.
  • Apply Design Thinking and Value Engineering concepts in collaborating with business stakeholders to identify, validate, value and prioritize the organization’s high-value use cases that will drive the organization’s data and analytics development roadmap.
  • Charter the Data Science team to “engineer” reusable, continuously-learning and adapting analytic assets that support the organization’s high priority use cases.
  • Develop an analytics culture that synergizes the AI / ML model-to-human collaboration that empowers teams at the point of customer engagement and operational execution.

If data is the new oil – the catalyst for the economic growth in the 21st century – then the Data Monetization Governance Board may very well be the most important structure in the modern organization.

Unleashing the Business Value of Technology End Goal

Hopefully this 4-blog series can help technology vendors unleash the business value of their customers’ technology investments.  This outcomes-driven process starts by (1) connecting with the customers’ sources of value creation, then (2) help the customer to “envision the possibilities” in leveraging data and analytics to drive business outcomes.  And finally, (3) provide an iterative delivery model to de-risk the customer’s path forward (see Figure 3).

Figure 3: Unleashing the Business Value of Technology Roadmap

In the end, if the technology vendor can’t help customers unleash the business value of their technology investments, then they are in the wrong business (see Figure 4).

Figure 4: Unleashing the Business Value of Technology

It should be like printing money…for your customers…

Source Prolead brokers usa

the early move advantage in ai leaders vs laggards vs aspirants
The early move advantage in AI:  leaders vs laggards vs aspirants

Are you a leader or a laggard or an aspirant in AI?

This is a subject close to my heart because I focus my teaching / research and consulting towards the leader end or AI – where the competitive advantages create exponential gains for companies and people

There is a great article from Mc Kinsey called Tipping the Scale of AI: How leaders capture exponential returns which makes this point eloquently

Here are the key takeaways from this article that resonate with me

If you want to be an AI leader, you should be paying attention to this.

Also, if you are working for a company, you should try and see if they aspire to be a leader or an also ran.

I think in a decade, just like we are seeing with the retail industry, many of these also-rans who do not invest in AI for competitive advantage will not exist.  

  • Where many companies tire of marginal gains from early AI efforts, the most successful recognize that the real breakthroughs in AI learning and scale come from persisting through the arduous phases.
  • Key lessons from AI leaders: Fund aggressively when conditions for success are in place; Build density in domains; Bring a rounded set of skills and invest in productivity; Speed execution with iterative releases; Win the front line
  • Many organizations underestimate what it takes to sow true gains, be it selecting the right seeds, apportioning the right investment, or having a mindset willing to put up with the vagaries of the crop cycle.
  • But for those that persevere, the rewards can be huge.
  • McKinsey research finds that leading organizations that approach the AI journey in the right ways and stick with it through the tough patches generate three to four times higher returns from their investments.
  • These AI leaders get on a different performance trajectory from the outset because they understand that AI is about mastering the long haul.
  • They prepare for that journey by anticipating the types of things that will make it easier to navigate the ups and downs, such as feedback loops that allow data quality and user adoption to compound and AI investments to become self-boosting.
  • Where some companies tire of marginal gains from weeks of effort, leaders recognize that the real breakthroughs in AI learning and scale come from working through those small steps.
  • But only a small number of businesses (10%) have figured out how to make AI work in these ways. The rest remain mired in the low to middling stages of maturity, with laggards making up 60 percent of the population and aspirants 30 percent
  • Top performers recognize that most of the impact comes from the last 20% of the journey
  • Leaders get disproportional impact from their AI investments.
  • The window of opportunity for underperformers is
  • Rather than dabbling in lots of different areas, they build strength and density in one or two domains, then expand from there. That approach allows them to deepen their use and application of unstructured data, access more sophisticated use cases, and layer in the necessary operational underpinnings—the investment, talent, data management, production, and other techniques that allow AI-enabled practices to become embedded into everyday routines
  • Moreover, as leaders build domain strength and reach a certain threshold in AI performance, their rate of learning and productivity increases, allowing them to progress through other domains faster and tackle problems of ever-increasing difficulty.
  • They recognize that scaling AI solutions to deal with increasingly sophisticated problems is hard, but necessary to capture value. Teaching a machine to identify human faces is one milestone, for instance, but getting the machine to recognize particular faces and only those faces is a far more complex undertaking.
  • Once solved, companies gain compounding benefits quickly.

 

While some of this can be seen as consultants prodding companies to action , to many of us, none of this is new. We have already seen how companies like Amazon are reaping exponential gains due to their investment in technology and AI

There is an early mover advantage in AI and companies who aspire to take a leadership position will gain exponential benefits in comparison to the laggards and the aspirants

Image source pixabay ninikvaratskhelia

 

Source Prolead brokers usa

applications of real world reinforcement learning
Applications of real-world reinforcement learning

This blog is the second part of a two-blog series. Here, we discuss different sectors where reinforcement learning can be used to solve complex problems efficiently. The blog is based on this paper. In the previous post, we studied the basics of reinforcement learning and how one can think of a problem as a Reinforcement Learning problem. In this follow-on post, we look at how real-world reinforcement learning applications can be developed. 

In general, to formulate any reinforcement learning problem, we need to define the environment, the agent, states, actions, and rewards. This idea forms the basis for the examples in this post. 

We cover reinforcement learning for – 

  1. Recommender systems
  2. Energy management
  3. Finance
  4. Transportation
  5. Healthcare

Recommender systems

Algorithms for recommendation systems are constantly evolving and reinforcement learning techniques play a key part in recommendation algorithms. Recommender systems face some unique challenges which can be addressed using reinforcement learning techniques. These challenges are 

  • The idiosyncratic nature of actions and 
  • A high degree of unobservability and 
  • Stochasticity in preferences, activities, personality etc.

Horizon is Facebook’s open source applied reinforcement learning platform for recommendations. We formulate the illustrations in the figure below in terms of reinforcement learning as – 

  • Action is sending or dropping of a notification.
  • Agent is the recommender system
  • Rewards are the interactions and activities on Facebook.
  • Environment is the user and the news.
  • State space is ongoing interaction and engagement of the user and the news along with the features representing the candidate to be notified.


Deep RL news recommender system. (from Zheng et al. (2018))

Energy

Cooling is quite an essential process for data centers in order to lower high temperatures and conserve energy. Reinforcement learning can be efficiently used for data center cooling. Ideally MPC (Model-predictive method) is used to monitor or regulate the temperature and airflow for the components in the data center, such as the fan speeds, water flow regulators, air handling units (AHUs) etc. This problem can be solved as a reinforcement learning technique as – 

  • The agent will be the controller. The agent learns a linear model of the data center operations with random, safe exploration, with little or no prior knowledge. 
  • The variables used to manipulate (ex: fan speed to control airflow, valve opening, etc.) represent the controls or actions
  • The reward is the cost of a trajectory
  • The process variables to predict and regulate (differential air pressure, cold-aisle temperature, entering air temperature to each AHU, leaving air temperature (LAT) from each AHU, etc.) represent the state.
  • The agent optimizes the cost (reward) of a trajectory based on the predicted model and generates actions at each step to mitigate the effect of model error catering for unexpected disturbances. 

For reinforcement learning purposes, the data center is modelled as a control loop for cooling processes. The figure below illustrates the process.

 

   

Finance

I have done a good amount of work in the financial sector and with the same domain knowledge, I think there are multiple problems that can be modelled as sequential decision problems in the financial sector. Reinforcement learning can be employed for some of these, which include problems such as option pricing, portfolio optimization, risk management, etc.  

In case of option pricing, the challenge comes with determining the right price for the option. To formulate option pricing as a reinforcement learning problem, we again define states, actions, and rewards as below –

  • The uncertainties affecting the option price are captured as part of the state. These include financial and economic factors such as interest rates. 
  • The actions could be the act of exercising the option. 
  • The reward is the intrinsic value of the option due to the change in state. 

When we model option pricing as a reinforcement learning problem, the entire training or process depends on learning the state-action-value function.

Transportation

Reinforcement Learning aims to improve efficiency and reduce cost for its applications in the transportation sector. Order dispatching process in ridesharing systems is one of the best applications of RL in transportation (example – Uber). The process of allocating a driver to a passenger is a complex process and depends on various factors such as demand prediction, route planning, fleet management, etc. The problem of order dispatching includes both spatial and temporal components. This problem could be formulated as a reinforcement learning problem where:

  • The state is composed of a driver’s geographical status, the raw timestamp, and the contextual feature vector (ex: driver service statistics, holiday indicators). 
  • An option represents state change for a driver in multiple time steps. 
  • A policy represents the probability of taking an option in a state. The RL algorithm aims to estimate an optimal policy and/or its value function.

 

     

    

 Composition and workflow of the order dispatching simulator. (from Tang et al. (2019))

The model is initialized using historical data. After that, the process is driven by an order dispatch policy learned with reinforcement learning.

Healthcare

Healthcare is one of the most crucial sectors where there are many opportunities and challenges for AI where reinforcement learning could be used. We will discuss some of these below – 

  • Dynamic treatment strategies (DTRs) –   DTR is a process of treatment which comprises of a sequence of decision rules that determine how the patient’s ongoing treatment should evolve based on the current state and the covariate history (A covariate represents any continuous variable that is expected to correlate with the outcome variable). DTRs apply to personalized treatment plans, typically for chronic conditions. 

In the case of DTRs, we could consider –

  • The state is composed of a multidimensional discrete-time series composed of variables of interest for the treatment (demographics, vital signs, etc.). We can use clustering to determine the state space such that patients in the same cluster are similar for the observable properties. 
  • An action comprises the medical treatment the patient receives administered as doses of medicine over a sequence of time. 
  • The reward directly is the patient’s health stating whether the health improves or deteriorates.

Try one for yourself?

Another healthcare application for reinforcement learning can be generation of reports from medical images. A medical report comprises specific segments such as the findings, the report’s conclusion (main finding and diagnosis), any secondary information, etc. For this case, I leave the problem on to the reader for formulating the same into a reinforcement learning problem.

Hint – In this scenario, first a CNN (convolutional neural network) is used to extract a set of images’ visual features and transform the features into a context vector. From this context vector, a sentence decoder generates latent topics recurrently. Based on a latent topic, a retrieval policy module generates sentences using either a generation approach or a template. The RL based retrieval policy integrates prior human knowledge.

Hope you enjoyed reading the blog! For any questions or doubts, please drop a comment.

About Me (Kajal Singh)

Kajal Singh is a Data Scientist and a Tutor at the  Artificial Intelligence – Cloud and Edge implementations  course at the University of Oxford. She is also the co-author of the book  “Applications of Reinforcement Learning to Real-World Data: An educational introduction to the fundamentals of Reinforcement Learning with practical examples on real data (2021)”.

Source Prolead brokers usa

how python saw the first black hole
How Python Saw the First Black Hole

On April 10, 2019, the news broke out that NASA scientists obtained the image of a black hole first time in the history of humanity. In this blog, we will be gaining knowledge about how our python programming language and its various tools contributed to obtaining this historic victory.

However, before jumping to the technical aspects of programming, you must know about EHT(Event Horizon telescope). It is said that EHT is computationally a telescope the size of the earth. It has an angular resolution of 20 micro-arcseconds that is enough to read a magazine in New Delhi from a park in Tokyo.

Challenges faced

EHT was trained on a highly massive black hole studied for over 100 years, present at the center of the M87 galaxy having mass 6.5 times the Sun’s but never observed before visually. Having this powerful device bought the following challenges to them.

  • Massive Data pre-processing
  • Rapid atmospheric phase fluctuations
  • Large recording bandwidth

EHT produce 350 TB of observations that makes reducing the volume and complexity of data extremely difficult

 Below you can see how the EHT data pre-processing pipeline looks like

Use of Python and its tools

The diagram below illustrates the role of the scientific python ecosystem in the data analysis of data produced by EHT.

A python package ‘eht. imaging’ uses Numpy, scipy, matplotlib, astropy, scikit-image as its core of array data pre-processing responsible for performing simulation and image reconstruction on data. Below is the dependency chart of ‘eht. imaging’ package(see the leaf nodes).

Key Python capabilities used

  1. Numpy’s and adaptable n-dimensional array helped researchers manipulate large and numerical data sets providing the foundation for the image of a black hole.
  2. Scaling a vast volume of data of 350 TB per day involved in EHT imaging.
  3. The complexity of data correlation from telescopes all across the globe with data synchronization.
  4. Speed for fast analysis capability to quickly image, manipulate datasets with corrections.

Conclusion

We all have witnessed the increasing popularity of Python in the past many years. Using Python from e-commerce to social media, there is nothing Python cannot do if used properly. Nevertheless, after seeing the usage of Python in advanced science, creating history is an achievement for this beautiful language.

Source Prolead brokers usa

secret behind the dimensionality reduction for data scientist
Secret behind the Dimensionality Reduction for Data Scientist

Dimensionality Reduction Data Science

Hello! I like to share my interesting experience While I was working as a junior Data Scientist, I can even say I was a beginner during that time in this data science domain.

One of the customers came to us for machine learning implementation for their problem statement in either way unsupervised and supervised forms, I thought it was going to be as usual mode of execution and process because based on my experience for small scale implementation or during my training period we use to have 25-30 features and we play around with that and we use to predict or classify or clustering the dataset and share the outcome.

But this time they come up with thousands of features, But I was a little surprised and scared about the implementation and my head started spinning as anything. Same time my Senior Data Scientist brought everyone from the team into the meeting room.

My Senior Data Scientist (Sr. DS) coined the new word to us, that is nothing but Dimensionality Reduction (OR) Dimension Reduction (OR) Curse Of Dimensionality, all beginners thought that he is going to explain something in Physis, we had little remembrance that we had come across this term during our training programme. then he started to sketch on the board (Refer fig-1). When we started looking at 1-D, 2-D we are much comfortable but 3-D and above our heads started to spin. 

Dimensionality Reduction 1D - 2D                                                                                   1-D and 2-D
Dimensionality Reduction weather report                                                                         3 – D

Sr. DS has continued his lecture, all these sample pictures are just notable features and we could play around with these, in a real-time scenario, many Machine Learning(ML) problems involve thousands of features, so we end up training those models became extremely slow and will not give good solutions for business problem and we couldn’t freeze the model, this situation is the so-called “Curse Of Dimensionality” working. Then we all started asking a question that how we should handle this.

He took a long breath and continue to share his experience in his own style.  He started with a simple definition as follows.

 

What is Dimensionality? 

We can say the number of features in our dataset is referred to as its dimensionality.

 

What is Dimensionality Reduction? 

Dimensionality Reduction is the process of reducing the dimensions(features) of a given dataset. Let’s say if your dataset with a hundred columns/features and bringing the number of columns down to 20-25.  In simple terms, you are converting the Cylinder/Sphere to a Circle or Cube into a Plane in the two-dimensional space as below figure.

Dimensionality Reduction 3D-2D                                                                Converting 3D- 2D

He has drawn below the relationship clearly between Modle Performance and Number of Features(Dimensions). As the number of features increases, the number of data points also increases proportionally. the straight statement is that the more features will bring more data samples, So we have represented all combinations of features and their values.

Dimensionality Reduction Mp Modle Performance Vs Number of Features

Now everyone in the room got the feel of what is “Curse Of Dimensionality” at a very high level.

 

Benefits of doing Dimensionality Reduction

Suddenly, one of the team members asked can he tell us the benefits of doing dimensionality reduction in the given dataset.

Our Sr. DS didn’t stop sharing his extensive knowledge further. He has continued as below.

There are lots of benefits if we go with dimensionality reduction.

  • It helps to remove redundancy in the features and noise error factors ultimately enhanced visualization of the given data set.
  • Excellent memory management activity has been exhibited due to dimensionality reduction.
  • Improving the performance of the model by choosing the right features by removing the unnecessary lists of features from the dataset.
  • Certainly, less number of dimensions (mandatory list of dimensions) required less computing efficiency and train the model faster with improved model accuracy.
  • Considerably reducing the Complexity and Overfitting of the overall model and its performance.

Yes! it was an awe-inspiring spectacle, robustness, and dynamics of the “Dimensionality Reduction”. Now I can visualization the overall benefit as below. hope it could help you too  

 

Benefits of doing Dimensionality Reduction
                                    Benefits of Dimensionality Reduction.

What is next, Of Course! We jump into the next major question that what are techniques available for Dimensionality Reduction.

 

Dimensionality Reduction – Techniques

Our Sr. DS very much interested continued his explanation on the techniques whichever possible in Data Science domain, broadly classified into two approaches as mentioned earlier considering selecting the best-fit Feature(s) or removing less important Feature in the given high dimensional dataset. these high-level techniques use to be called Feature Selection or Feature Extraction, and basically, this is part of Feature Engineering. He has connected the dots perfectly.

Locating Dimensionality Reduction in Feature Engineering family                               Locating Dimensionality Reduction in Feature Engineering family

He took us further in-depth concepts to understand the big picture of applied “Dimensionality Reduction” on the high dimensional dataset. Once we saw the below figure we able to relate the Feature Engineering and Dimensionality Reduction. Look at this figure the essence of Dimensionality Reduction well by our Sr. DS is in it!

Dimensionality reduction

Everyone was interested to know how to apply all these using Phyton libraries with the help of simple coding. our Sr. DS asked me to bring colorful markers and dusters

marker

Sr. DS picked up the new blue marker and started explaining PCA with a simple example as follows, before that he explained what is PCA stuff for dimensionality reduction.

Principal Component Analysis(PCA): PCA is a technique for dimensionality reduction of a given dataset, by increasing interpretability with negligible information loss. Here the number of variables is decreasing, so it makes further analysis simpler. Which converts a set of correlated variables to a set of uncorrelated variables. Used for machine learning predictive modeling. And he advised us to go through Eigenvector, Eigen Values

He took familiar wines.csv for his quick analysis.

 

PCA

# Import all the necessary packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns from sklearn.model_selection
import train_test_split from sklearn.linear_model
import LinearRegression from sklearn.metrics
import confusion_matrix from sklearn.metrics
import accuracy_score from sklearn
import metrics %matplotlib inline
import matplotlib.pyplot as plt
%matplotlib inline

wq_dataset = pd.read_csv('winequality.csv')
EDA on a given data set
wq_dataset.head(5)
dataset.head
wq_dataset.describe()
describe
wq_dataset.isnull().any()
null

No Null value in the given data set, So great and we’re lucky.

Find correlations of each feature

correlations = wq_dataset.corr()['quality'].drop('quality') print(correlations)
correlations

Correlation Representation using Heatmap

sns.heatmap(wq_dataset.corr()) plt.show()

 

correlation dimensionality reduction

x = wq_dataset[features] y = wq_dataset['quality']

[‘fixed acidity’, ‘volatile acidity’, ‘citric acid’, ‘chlorides’, ‘total sulfur dioxide’, ‘density’, ‘sulphates’, ‘alcohol’]

# Create training and testing set using train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=3)

Training and Testing Shape

print('Traning data shape:', x_train.shape) print('Testing data shape:', x_test.shape)
Traning data shape: (1199, 8) Testing data shape: (400, 8)

PCA implementation for Dimensionality reduction (with 2 columns)

from sklearn.decomposition import PCA pca_wins = PCA(n_components=2) principalComponents_wins = pca_wins.fit_transform(x)

Naming them as principal component 1, principal component 2

pcs_wins_df = pd.DataFrame(data = principalComponents_wins, columns = ['principal component 1', 'principal component 2'])

New principal components and their values.

pcs_wins_df.head()
principal components and their values

We all surprised when looking at the above two columns with new column name and values, We asked what happen to ‘fixed acidity’, ‘volatile acidity, ‘citric acid’, ‘chlorides’, ‘total sulfur dioxide’, ‘density’, ‘sulphates’, ‘alcohol’ columns. Sr. DS said all gone, now we have just two columns after we applied PCA for dimensionality reduction on given data and we are going to implement few models and this is going to be the normal way.

He has mentioned one keyword “variation per principal component”

this is the fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance.

print('Explained variation per principal component: {}'.format(pca_wins.explained_variance_ratio_))
Explained variation per principal component: [0.99615166 0.00278501]

Followed by this he was demonstrated the following models

  • Logistic Regression
  • Random forest
  • KNN
  • Naive Bayes

Accuracy was better and little difference among each model, but he has mentioned this is for PCA implementation. Everyone in the room felt that we have completed an excellent roller coaster. he has advised us to do hands-on other Dimensionality Reduction – Techniques.

Dimensionality Reduction - Techniques

Okay, Guys! Thanks for your time, hope I able to narrate my learning experience of Dimensionality Reduction – Techniques in right ways here, I trust it would help to continue the journey to handle complex data set in machine learning problem statement. Cheers!

Source Prolead brokers usa

python pandas an in depth tool for data analytics
Python Pandas – An in-depth tool for Data Analytics

Hey GEEKS! Here’s my new blog on Pandas library, an in-depth tool for Data science learners.

It’s never too difficult to transform your dataset into a valuable piece for your project. It requires deep research and study about knowing what is your interesting part of data.

Python enables you to get your hands dirty with data in many ways. One of them,which is my favorite is its PANDAS library. As much as its name excites me, also its flexible nature of accepting all types of data like JSON,CSV, XLSX, etc, and numerous fancy features like slice/label indexing, time series functions facilitate me in data analysis.

Here I’ve got some really helpful commands that would help you analyze your data better.

first, we need to import the dataset

The EDA gets tougher in handling large datasets. Here are some quick DS hacks to help you out!

select_dtypes()

This function enables you to select columns in a data frame with their data types. By this, you can make many useful subsets of data based upon their column dtypes.

Sounds interesting? Let me show you with code

melt()

The melt() function is used to transform a data frame from wide to a long format. It can take multiple columns as identifiers of the table and allows you to see your data with two non-identifier columns (variable and value) to observe the measured variables.


Identifiers(car & age) tells about non-identifiers (either the insurance is taken or not)

columns.tolist()

Another interesting practice most common in data munging and analysis is to convert columns obj datatype to a list. Since its easier to make modifications to a list as compared to an object, this method helps us to extract only those columns which have a great impact on our target column or make changes in the order of the columns in a data frame.


car_loan column becomes the 1st column now

str.split()

With the help of Pandas, we can split a single string into multiple columns with str.split() method. This means it will return a data frame with all separated strings in different columns. You can easily do this by specifying the separator value in the function. When a separator isn’t written, whitespace is taken as an input.


Column Car splits into two columns vehicle and car_number

to_datetime()

Handling datetime features in data science is also very important as it requires correct formatting and values to train the model.

This function takes datetime argument as an input parameter and converts it into a python datetime object.

Lambda Functions:

They are the nameless functions(anonymous) that can take multiple arguments(bound variable) but returns only one expression(the body part). These functions execute faster as compared to normal functions.


We apply a lambda function with the help of apply() function.

In the above example, we have merged call_start and call_end columns into a single column i.e. call_duration, and then called lambda function to convert the datetime object value into integer with the help of x.seconds() function.

Notebook Link : Data-science/DSHacks.ipynb at main · ToobaAhmedAlvi/Data-science (github.com)

I’ll come with some more interesting hacks. Until then stay connected. 🙂

Source Prolead brokers usa

Pro Lead Brokers USA | Targeted Sales Leads | Pro Lead Brokers USA
error: Content is protected !!