Search for:
understanding the role of artificial intelligence in the banking sector
Understanding The Role of Artificial Intelligence in the Banking Sector

In the era of technological developments, Artificial Intelligence has managed to take the center stage. 

From startups to MNC’s all the organizations are incorporating AI as an integral part of their business at a large scale. 

According to a report published by Adobe, the share of jobs requiring AI has increased by 450% since 2013.

 In the digital era, with every industry getting disrupted by technological advancement it becomes more and more crucial that consumers get a seamless experience for their banking process. 

Since the growth in the working population and increase in disposable income are acting as catalysts for boosting the demand for digital banking. In this blog, you will learn about how the role of Artificial Intelligence is helping the banking sector. 

Key Takeaways 

  • The growth in the working population and increase in disposable income is acting as a catalyst for boosting the demand for digital banking.
  •  54% of financial services organizations with over 5000+ employees have adopted AI.
  • Some of the advantages of adopting AI in the Banking sector includes Risk Management, Fraud Detection, enhanced Customer Service Experience, Quick Resolution Time and Digitization. 

 

Adoption of AI in the Banking sector

Artificial Intelligence has taken the banking industry by storm with its offerings and services. Not long ago your nearest vegetable vendor was accepting only cash as a mode of payment. Now you will notice QR scanners for online transactions at almost every shop you visit, which tells us that technology adoption is taking place at a rapid pace due to high internet penetration.

India recorded an internet penetration of 41% this year.  According to a report from 2016, a lot of Banks have collaborated with Fintech Startups to provide innovative solutions and a seamless banking experience to their customers.

According to a report published by the Economist Intelligence Unit, 54% of financial services organizations with over 5000+ employees have adopted AI. Even though AI presents a huge opportunity for the banking sector to step up its game, we are still in the stage of taking baby steps. 

 

How is AI Revolutionizing the Banking industry of India?

Artificial intelligence is changing the face of the Indian Banking sector with Top Banking firms of India such as Bank of Baroda, State Bank of India, Axis, HDFC and others adopting AI into different domains of their financial infrastructure. The Indian Banking sector is one of the largest which is well-capitalized and regulated.

As per a report published in 2020, the Indian banking sector has a total of 22 private sector banks, 56 regional banks, 46 foreign banks, 96,000 rural cooperative banks and 1485 urban cooperative banks.  A majority of the data gets churned out during the process of Back Office Banking Operations.

It refers to the process of managing huge volumes of data and customer databases in order to gain insights that will help in order to help financial institutions function smoothly.

AI is playing a massive role in this domain by assessing the creditworthiness of the borrowers, detecting frauds, money laundering and even enhancing the relationships with the customers. 

 

Types of AI that is being utilized in the Banking sector

Although there are 7 different types of AI, yet Artificial Intelligence can be broadly classified into 2 types, namely:

  1. Weak AI
  2. Strong AI 

Weak AI refers to the type of Artificial Intelligence which is primarily used to perform a task to solve a specific problem. These intelligent systems are fed with pre-defined sets of functions to perform a particular task smartly.

Therefore, they are termed as Weak AI or Narrow AI. 

Whereas, on the other hand, Strong AI refers to the type of Artificial Intelligence to put it into simple words, the function of a Strong AI is to broadly mimic the human brain.

This means that it has been designed in such a way that it can carry out any task that an actual human being can. 

 

Advantages of AI in the Banking sector

So far we have understood the adoption of Artificial intelligence in the Banking sector. Now, let’s understand the various benefits AI can offer to this sector in this section.

  • Risk management: One of the most crucial parts of dealing with Banking is Risk Management. Credit Risk Management refers to a situation when the borrower fails to repay the loan and other contractual deals within the stipulated time frame. AI will help with this aspect as Artificial Intelligence will help in tracking mobile banking apps to track and analyze how a user is dealing with their money. This will eventually help the bank to understand the risks associated with sanctioning loans to people and other credit risk management. 
  • Fraud Detection: Every year we are bombarded with news about scams and other financial frauds worth crores of Rupees. In 2020 alone 47% of companies have witnessed fraud. With the technological advancements, even the scammers are finding new ways to get their way through and fool the system. Therefore, to ensure that we reduce the rate of frauds that take place, banking sectors have started using ML-Driven Fraud Analytics. Machine Learning works on the concept of “learning from experience”. By using ML algorithms machines can be trained to find the difference between legitimate and fraudulent transactions. Which will ultimately help in preventing any kind of abnormal activities from taking place. 
  • Seamless Customer Experience: AI in Banking offers a seamless customer experience through Personal Finance with smart features such as chatbots, subscription services and customized notifications. With the help of chatbots, customers can easily be assisted for queries regarding past transactions, spending analysis, savings and finance-related information irrespective of Bank closure or National holidays. This instant dissemination of information will not only help save time but also help the customer to understand the products and services provided by the bank in a better way. 
  • Digitization of the process: With the expansion of internet penetration across the country, digitization of the banking process is becoming a must. One of the biggest challenges that can be resolved through the digitization of the banking process is avoiding the hassle of standing in long queues for getting the bank work done by turning most of the procedures “paperless”. One of the most popular initiatives is the Know Your Customer (KYC) registration process which has become completely online now. 
  • Quick Resolution Time : Artificial Intelligence-powered chatbots are helping the Banking industry in a massive by reducing the operational cost and enhancing the customer experience with quick resolution time. Since now, AI Chatbots function on the concept of “Learning from Experience” which helps in reducing human errors by a huge percentage. 

 

Wrapping Up 

In the digital era, “data” is the most important form of currency. With so much of data available everywhere it becomes more important for organizations to make use of it effectively and efficiently.

The technology that is revolutionizing the Fintech industry currency is none other than “Artificial Intelligence”. AI is providing financial institutions with a platform to help people, data and services function coherently.

The role of AI in banking is continuing to gain prominence and the global spending on AI is predicted to touch $300 billion by 2030. While a report published by IBEF estimates that India’s Digital lending would reach US$ 1 Trillion by 2023 as it will be driven by a five-fold increase in digital reimbursements. 

Source Prolead brokers usa

how can seo testing increase traffic and profits
How Can SEO Testing Increase Traffic and Profits?

Quality is the most important criterion by which Google ranks websites in SERP. The more convenient, useful, and interesting a web application is, the more users turn to it. The growing number of visitors makes it clear for the search engine that the website has value and should be raised higher so that it can be viewed by as many people as possible. Being in the TOP 10 in the search results is a paramount task, as statistically, only 0.78% of users reach the second page of Google. Therefore, companies should give consideration to not only an SEO audit but also SEO testing because it directly affects the traffic and profits of an organization. Let’s figure out how it works.

What SEO testing is and why it is important

SEO testing is not much different from other types of testing. It involves searching for errors: 404 pages, broken links, irrelevant code, problems with loading pages, visual defects, and so on. All this can be found during UX testing, performance testing, or cross-platform testing. And the goal of SEO testing is to identify problems that can arise after changes in a website before they affect the quality of the web application and organic traffic.

From a tester’s point of view, the 404 error is not a significant bug: it doesn’t cause disruption of the website, and the visitor can go to another page and continue the search. But from the SEO perspective, hundreds of 404 pages can significantly reduce organic traffic. According to Google, 61% of users will leave a website if it has access issues. The search engine sees a lot of low-quality pages with duplicate content and lowers the “intruder” in the ranking.

Here’s another example of the direct impact of errors on traffic. Let’s suppose developers have updated an important page and deleted its heading – H1- by accident. And H1 is one of the mandatory ranking factors, according to which the search engine determines the page content. This is the heading of the page that users see. If H1 is deleted, this important page will simply drop out of the search results, which will lead to a drop in traffic.

From these examples, it becomes clear that any change can lead to the accidental loss of data important for SEO, which will affect the quality of the website and its ranking in the search engine. That’s why testers should devote their attention to SEO. This will allow a business to maintain an unbreakable chain:

  1. The higher the quality of the website that QA provides is, the higher the web application is ranked.
  2. The higher the search engine score is, the more organic traffic comes to the website.
  3. The higher the organic traffic is, the more orders for products or paid services the business receives.
  4. The more leads the business has, the better.

No website is immune to problems with SEO. Google has over 200 ranking factors, and the search engine changes its algorithms up to 1000 times a year. You needn’t strive to please it. What’s important is to create a high-quality and user-friendly website and check it after each change in order to provide yourself with the first positions in the search results.

An SEO audit and SEO testing: what’s the difference?

Sometimes non-experts get confused about the difference between an SEO audit and SEO testing. The SEO audit is an analysis of the current state of the website manually or using a special tool. This procedure helps to understand whether the pages are indexed, whether meta tags are written for them, whether the images are optimized, etc. In other words, such research helps to identify content gaps or deficiencies in the information architecture.

SEO testing involves tracking the results after changes to assess their impact or effectiveness. A well-tuned QA process for SEO includes:

  • Benchmarking testing, where two versions of the source code are compared (intermediate and production);
  • Testing elements important for SEO (for example, metadata);
  • Automation (using tools that collect all changes between preparation and production);
  • Monitoring changes when the application is in production;
  • An archive of web pages, which contains a history of changes and source code you can return to in the case of a traffic drop.

QA testing helps to identify problems with a website before a product hits the market. The practice of good QA for SEO works as a safety cable, eliminating potential problems and reducing the number of bug fixes.

How often should SEO testing be done?

SEO testing is worth doing as your website is updated. As QA practitioners note, identifying SEO bugs is quite difficult – they may not affect the overall functionality of the website. It takes time for the search engines to re-index the website after bugs are found and fixed. If an error is not instantly eliminated, the website will no longer be included in the first search results.

Fortunately, this is much easier to do, since experts have access to tools that automate the process of collecting data. They identify quality control problems that need to be solved. This makes testers’ jobs much easier, as they no longer need to manually check every page, link, or image.

SEO testing helps to prevent failed migrations, fraudulent redirects, unintentional indexing, disappearing tags, and more. It allows specialists to check important elements for ranking: broken links, missing content, page load speed, missing metadata, and other issues that affect SEO performance.

With so many hands working on a website (developers, designers, project managers, and so on), every new update poses a risk. Since these updates directly affect the sales and success of a business, SEO testing should be an important part of increasing organic search traffic on Google. Therefore, as part of SEO promotion, it is worth tapping into the services of QA specialists who focus on finding defects and improving the quality of software.

Source Prolead brokers usa

machine learning with h2o in r python
Machine learning with H2O in R / Python

In this blog, we shall discuss about how to use H2O to build a few supervised machine learning models. H2O is a Java-based software for data modeling and general computing, with the primary purpose of it being a distributed, parallel, in memory processing engine. It needs to be installed first (instructions) and by default an H2O instance will run on localhost:54321. Additionally, one needs to install R/python clients to to communicate with the H2O instance. Every new R / python session first needs to initialize a connection between the python client and the H2O cluster.

The problems to be described in this blog appeared in the exercises / projects in the Coursera course “Practical Machine Learning on H2O,” by H2O. The problem statements / descriptions / steps are taken from the course itself. We shall use the concepts from the course, in order to:

  • to build a few machine learning / deep learning models using different algorithms (such as Gradient Boosting, Random Forest, Neural Net, Elastic Net GLM etc.),
  • to review the classic bias-variance tradeoff (overfitting)
  • for hyper-parameter tuning using Grid Search
  • to use AutoML to automatically find a bunch of good performing models
  • to use Stacked Ensembles of models to improve performance.

Problem 1

In this problem we will create an artificial data set, then run random forest / GBM on it with H2O, to create two supervised models for classification, one that is reasonable and another one that shows clear over-fitting. We will use R client (package) for H2O for this problem.

  1. Let’s first create a data set to predict an employee’s job satisfaction in an organization. Let’s say an employee’s job satisfaction depends on the following factors (there are several other factors in general, but we shall limit us to the following few ones):
    • work environment
    • pay
    • flexibility
    • relationship with manager
    • age
set.seed(321) # Let's say an employee's job satisfaction depends on the work environment, pay, flexibility, relationship with manager and age. N <- 1000 # number of samples d <- data.frame(id = 1:N) d$workEnvironment <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good v <- round(rnorm(N, mean=60000, sd=20000)) # 68% are 40-80k v <- pmax(v, 20000) v <- pmin(v, 100000) #table(v) d$pay <- v d$flexibility <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good d$managerRel <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good d$age <- round(runif(N, min=20, max=60)) head(d) # id workEnvironment pay flexibility managerRel age #1 1 2 20000 2 2 21 #2 2 5 75817 1 2 31 #3 3 5 45649 5 3 25 #4 4 1 47157 1 5 55 #5 5 2 69729 2 4 33 #6 6 1 75101 2 2 39 v <- 125 * (d$pay/1000)^2 # e.g., job satisfaction score is proportional to square of pay (hypothetically) v <- v + 250 / log(d$age) # e.g., inversely proportional to log of age v <- v + 5 * d$flexibility v <- v + 200 * d$workEnvironment v <- v + 1000 * d$managerRel^3 v <- v + runif(N, 0, 5000) v <- 100 * (v - 0) / (max(v) - min(v)) # min-max normalization to bring the score in 0-100 d$jobSatScore <- round(v) # Round to nearest integer (percentage)

2. Let’s start h2o, and import the data.

library(h2o) h2o.init() as.h2o(d, destination_frame = "jobsatisfaction") jobsat <- h2o.getFrame("jobsatisfaction") # |===========================================================================================================| 100% # id workEnvironment pay flexibility managerRel age jobSatScore #1 1 2 20000 2 2 21 5 #2 2 5 75817 1 2 31 55 #3 3 5 45649 5 3 25 22 #4 4 1 47157 1 5 55 30 #5 5 2 69729 2 4 33 51 #6 6 1 75101 2 2 39 54 

3. Let’s split the data. Here we plan to use cross-validation.

parts <- h2o.splitFrame( jobsat, ratios = 0.8, destination_frames=c("jobsat_train", "jobsat_test"), seed = 321) train <- h2o.getFrame("jobsat_train") test <- h2o.getFrame("jobsat_test") norw(train) # 794 norw(test) # 206 rows y <- "jobSatScore" x <- setdiff(names(train), c("id", y)) 

4. Let’s choose the gradient boosting model (gbm), and create a model. It’s a regression model since the output variable is treated to be continuous.

# the reasonable model with 10-fold cross-validation m_res <- h2o.gbm(x, y, train, model_id = "model10foldsreasonable", ntrees = 20, nfolds = 10, seed = 123) > h2o.performance(m_res, train = TRUE) # RMSE 2.973807 #H2ORegressionMetrics: gbm #** Reported on training data. ** #MSE: 8.069509 #RMSE: 2.840688 #MAE: 2.266134 #RMSLE: 0.1357181 #Mean Residual Deviance : 8.069509 > h2o.performance(m_res, xval = TRUE) # RMSE 3.299601 #H2ORegressionMetrics: gbm #** Reported on cross-validation data. ** #** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) ** #MSE: 8.84353 #RMSE: 2.973807 #MAE: 2.320899 #RMSLE: 0.1384746 #Mean Residual Deviance : 8.84353 > h2o.performance(m_res, test) # RMSE 0.6476077 #H2ORegressionMetrics: gbm #MSE: 10.88737 #RMSE: 3.299601 #MAE: 2.524492 #RMSLE: 0.1409274 #Mean Residual Deviance : 10.88737 

5. Let’s try some alternative parameters, to build a different model, and show how the results differ.

# overfitting model with 10-fold cross-validation m_ovf <- h2o.gbm(x, y, train, model_id = "model10foldsoverfitting", ntrees = 2000, max_depth = 20, nfolds = 10, seed = 123) > h2o.performance(m_ovf, train = TRUE) # RMSE 0.004474786 #H2ORegressionMetrics: gbm #** Reported on training data. ** #MSE: 2.002371e-05 #RMSE: 0.004474786 #MAE: 0.0007455944 #RMSLE: 5.032019e-05 #Mean Residual Deviance : 2.002371e-05 > h2o.performance(m_ovf, xval = TRUE) # RMSE 0.6801615 #H2ORegressionMetrics: gbm #** Reported on cross-validation data. ** #** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) ** #MSE: 0.4626197 #RMSE: 0.6801615 #MAE: 0.4820542 #RMSLE: 0.02323415 #Mean Residual Deviance : 0.4626197 > h2o.performance(m_ovf, test) # RMSE 0.4969761 #H2ORegressionMetrics: gbm #MSE: 0.2469853 #RMSE: 0.4969761 #MAE: 0.3749822 #RMSLE: 0.01698435 #Mean Residual Deviance : 0.2469853

Problem 2

Predict Chocolate Makers Location with Deep Learning Model with H2O

The data is available here: http://coursera.h2o.ai/cacao.882.csv

This is a classification problem. We need to predict “Maker Location.” In other words, using the rating, and the other fields, how accurately we can identify if it is Belgian chocolate, French chocolate, and so on. We shall use python client (library) for H2O for this problem.

  1. Let’s start H2O, load the data set, and split it. By the end of this stage we should have
    three variables, pointing to three data frames on H2O: train, valid, test. However, if you are choosing to use
    cross-validation, you will only have two: train and test.
import H2O import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('http://coursera.h2o.ai/cacao.882.csv') print(df.shape) # (1795, 9) df.head() 
Maker Origin REF Review Date Cocoa Percent Maker Location Rating Bean Type Bean Origin
0 A. Morin Agua Grande 1876 2016 63% France 3.75 Sao Tome
1 A. Morin Kpime 1676 2015 70% France 2.75 Togo
2 A. Morin Atsane 1676 2015 70% France 3.00 Togo
3 A. Morin Akata 1680 2015 70% France 3.50 Togo
4 A. Morin Quilla 1704 2015 70% France 3.50 Peru
print(df['Maker Location'].unique()) # ['France' 'U.S.A.' 'Fiji' 'Ecuador' 'Mexico' 'Switzerland' 'Netherlands' # 'Spain' 'Peru' 'Canada' 'Italy' 'Brazil' 'U.K.' 'Australia' 'Wales' # 'Belgium' 'Germany' 'Russia' 'Puerto Rico' 'Venezuela' 'Colombia' 'Japan' # 'New Zealand' 'Costa Rica' 'South Korea' 'Amsterdam' 'Scotland' # 'Martinique' 'Sao Tome' 'Argentina' 'Guatemala' 'South Africa' 'Bolivia' # 'St. Lucia' 'Portugal' 'Singapore' 'Denmark' 'Vietnam' 'Grenada' 'Israel' # 'India' 'Czech Republic' 'Domincan Republic' 'Finland' 'Madagascar' # 'Philippines' 'Sweden' 'Poland' 'Austria' 'Honduras' 'Nicaragua' # 'Lithuania' 'Niacragua' 'Chile' 'Ghana' 'Iceland' 'Eucador' 'Hungary' # 'Suriname' 'Ireland'] print(len(df['Maker Location'].unique())) # 60 loc_table = df['Maker Location'].value_counts() print(loc_table) #U.S.A. 764 #France 156 #Canada 125 #U.K. 96 #Italy 63 #Ecuador 54 #Australia 49 #Belgium 40 #Switzerland 38 #Germany 35 #Austria 26 #Spain 25 #Colombia 23 #Hungary 22 #Venezuela 20 #Madagascar 17 #Japan 17 #New Zealand 17 #Brazil 17 #Peru 17 #Denmark 15 #Vietnam 11 #Scotland 10 #Guatemala 10 #Costa Rica 9 #Israel 9 #Argentina 9 #Poland 8 #Honduras 6 #Lithuania 6 #Sweden 5 #Nicaragua 5 #Domincan Republic 5 #South Korea 5 #Netherlands 4 #Amsterdam 4 #Puerto Rico 4 #Fiji 4 #Sao Tome 4 #Mexico 4 #Ireland 4 #Portugal 3 #Singapore 3 #Iceland 3 #South Africa 3 #Grenada 3 #Chile 2 #St. Lucia 2 #Bolivia 2 #Finland 2 #Martinique 1 #Eucador 1 #Wales 1 #Czech Republic 1 #Suriname 1 #Ghana 1 #India 1 #Niacragua 1 #Philippines 1 #Russia 1 #Name: Maker Location, dtype: int64 loc_table.hist() 

As can be seen from the above table, some of the locations have too few records, which will result in poor accuracy of the model to be learnt on after splitting the dataset into train, validation and test datasets. Let’s get rid of the locations that have small number of (< 40) examples in the dataset, to make the results more easily comprehendible, by reducing number of categories in the output variable.

## filter out the countries for which there is < 40 examples present in the dataset loc_gt_40_recs = loc_table[loc_table >= 40].index.tolist() df_sub = df[df['Maker Location'].isin(loc_gt_40_recs)] # now connect to H2O h2o.init() # h2o.clusterStatus() 
H2O cluster uptime: 1 day 14 hours 48 mins
H2O cluster version: 3.13.0.3978
H2O cluster version age: 4 years and 9 days !!!
H2O cluster name: H2O_started_from_R_Sandipan.Dey_kpl973
H2O cluster total nodes: 1
H2O cluster free memory: 2.530 Gb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: Algos, AutoML, Core V3, Core V4
Python version: 3.7.6 final
h2o_df = h2o.H2OFrame(df_sub.values, destination_frame = "cacao_882", column_names=[x.replace(' ', '_') for x in df.columns.tolist()]) #h2o_df.head() #h2o_df.summary() df_cacao_882 = h2o.get_frame('cacao_882') # df_cacao_882.as_data_frame() #df_cacao_882.head() df_cacao_882.describe() 
Maker Origin REF Review_Date Cocoa_Percent Maker_Location Rating Bean_Type Bean_Origin
type enum enum int int enum enum real enum enum
mins 5.0 2006.0 1.0
mean 1025.8849294729039 2012.273942093541 3.1818856718633928
maxs 1952.0 2017.0 5.0
sigma 553.7812013716441 2.978615633185091 0.4911459825968248
zeros 0 0 0
missing 0 0 0 0 0 0 0 0 0
0 A. Morin Agua Grande 1876.0 2016.0 63% France 3.75 <0xA0> Sao Tome
1 A. Morin Kpime 1676.0 2015.0 70% France 2.75 <0xA0> Togo
2 A. Morin Atsane 1676.0 2015.0 70% France 3.0 <0xA0> Togo
3 A. Morin Akata 1680.0 2015.0 70% France 3.5 <0xA0> Togo
4 A. Morin Quilla 1704.0 2015.0 70% France 3.5 <0xA0> Peru
5 A. Morin Carenero 1315.0 2014.0 70% France 2.75 Criollo Venezuela
6 A. Morin Cuba 1315.0 2014.0 70% France 3.5 <0xA0> Cuba
7 A. Morin Sur del Lago 1315.0 2014.0 70% France 3.5 Criollo Venezuela
8 A. Morin Puerto Cabello 1319.0 2014.0 70% France 3.75 Criollo Venezuela
9 A. Morin Pablino 1319.0 2014.0 70% France 4.0 <0xA0> Peru
df_cacao_882['Maker_Location'].table() #Maker_Location Count #Australia 49 #Belgium 40 #Canada 125 #Ecuador 54 #France 156 #Italy 63 #U.K. 96 #U.S.A. 764 train, valid, test = df_cacao_882.split_frame(ratios = [0.8, 0.1], destination_frames = ['train', 'valid', 'test'], seed = 321) print("%d/%d/%d" %(train.nrows, valid.nrows, test.nrows)) # 1082/138/127 

2. Let’s set x to be the list of columns we shall use to train on, to be the column we shall learn. Here it’s going to be a multi-class classification problem.

ignore_fields = ['Review_Date', 'Bean_Type', 'Maker_Location'] # Specify the response and predictor columns y = 'Maker_Location' # multinomial Classification x = [i for i in train.names if not i in ignore_fields]

3. Let’s now create a baseline deep learning model. It is recommended to use all default settings (remembering to
specify either nfolds or validation_frame) for the baseline model.

from h2o.estimators.deeplearning import H2ODeepLearningEstimator model = H2ODeepLearningEstimator() %time model.train(x = x, y = y, training_frame = train, validation_frame = valid) # deeplearning Model Build progress: |██████████████████████████████████████| 100% # Wall time: 6.44 s model.model_performance(train).mean_per_class_error() # 0.05118279569892473 model.model_performance(valid).mean_per_class_error() # 0.26888404593884047 perf_test = model.model_performance(test) print('Mean class error', perf_test.mean_per_class_error()) # Mean class error 0.2149184149184149 print('log loss', perf_test.logloss()) # log loss 0.48864148412056846 print('MSE', perf_test.mse()) # MSE 0.11940531127368789 print('RMSE', perf_test.rmse()) # RMSE 0.3455507361787671 perf_test.hit_ratio_table()
Top-8 Hit Ratios: 
k hit_ratio
1 0.8897638
2 0.9291338
3 0.9527559
4 0.9685039
5 0.9763779
6 0.9921259
7 0.9999999
8 0.9999999
perf_test.confusion_matrix().as_data_frame() 
Australia Belgium Canada Ecuador France Italy U.K. U.S.A. Error Rate
0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.400000 2 / 5
1 0.0 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.333333 1 / 3
2 0.0 0.0 12.0 0.0 0.0 0.0 0.0 1.0 0.076923 1 / 13
3 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.000000 0 / 3
4 0.0 0.0 0.0 0.0 8.0 2.0 0.0 1.0 0.272727 3 / 11
5 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.000000 0 / 10
6 0.0 0.0 0.0 1.0 0.0 2.0 4.0 4.0 0.636364 7 / 11
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 71.0 0.000000 0 / 71
8 3.0 2.0 12.0 4.0 8.0 15.0 4.0 79.0 0.110236 14 / 127
model.plot() 

4. Now, let’s create a tuned model, that gives superior performance. However we should use no more than 10 times
the running time of your baseline model, so again our script should be timing the model.

model_tuned = H2ODeepLearningEstimator(epochs=200, distribution="multinomial", activation="RectifierWithDropout", stopping_rounds=5, stopping_tolerance=0, stopping_metric="logloss", input_dropout_ratio=0.2, l1=1e-5, hidden=[200,200,200]) %time model_tuned.train(x, y, training_frame = train, validation_frame = valid) #deeplearning Model Build progress: |██████████████████████████████████████| 100% #Wall time: 30.8 s model_tuned.model_performance(train).mean_per_class_error() #0.0 model_tuned.model_performance(valid).mean_per_class_error() #0.07696485401964853 perf_test = model_tuned.model_performance(test) print('Mean class error', perf_test.mean_per_class_error()) #Mean class error 0.05909090909090909 print('log loss', perf_test.logloss()) #log loss 0.14153784501504524 print('MSE', perf_test.mse()) #MSE 0.03497231075826773 print('RMSE', perf_test.rmse()) #RMSE 0.18700885208531637 perf_test.hit_ratio_table()
Top-8 Hit Ratios: 
k hit_ratio
1 0.9606299
2 0.984252
3 0.984252
4 0.992126
5 0.992126
6 0.992126
7 1.0
8 1.0
perf_test.confusion_matrix().as_data_frame()
Australia Belgium Canada Ecuador France Italy U.K. U.S.A. Error Rate
0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0 / 5
1 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0 / 3
2 0.0 0.0 13.0 0.0 0.0 0.0 0.0 0.0 0.000000 0 / 13
3 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.000000 0 / 3
4 0.0 0.0 0.0 0.0 11.0 0.0 0.0 0.0 0.000000 0 / 11
5 0.0 0.0 0.0 0.0 1.0 8.0 0.0 1.0 0.200000 2 / 10
6 0.0 0.0 0.0 0.0 0.0 0.0 8.0 3.0 0.272727 3 / 11
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 71.0 0.000000 0 / 71
8 5.0 3.0 13.0 3.0 12.0 8.0 8.0 75.0 0.039370 5 / 127
model_tuned.plot() 

As can be seen from the above plot, the early-stopping strategy stopped the model to overfit and the model achieves better accruacy on the test dataset..

5. Let’s save both the models, to the local disk, using save_model(), to export the binary version of the model. (Do not export a POJO.)

h2o.save_model(model, 'base_model') h2o.save_model(model_tuned, 'tuned_model')

We may want to include a seed in the model function above to get reproducible results.

Problem 3

Predict Price of a house with Stacked Ensemble model with H2O

The data is available at http://coursera.h2o.ai/house_data.3487.csv. This is a regression problem. We have to predict the “price” of a house given different feature values. We shall use python client for H2O again for this problem.

The data needs to be split into train and test, using 0.9 for the ratio, and a seed of 123. That should give 19,462 training rows and 2,151 test rows. The target is an RMSE below $123,000.

  1. Let’s start H2O, load the chosen dataset and follow the data manipulation steps. For example, we can split date into year and month columns. We can then optionally combine them into a numeric date column. At the end of this step we shall have traintestx and y variables, and possibly valid also. The below shows the code snippet to do this.
import h2o import pandas as pd import numpy as np import matplotlib.pyplot as plt import random from time import time h2o.init() url = "http://coursera.h2o.ai/house_data.3487.csv" house_df = h2o.import_file(url, destination_frame = "house_data") # Parse progress: |█████████████████████████████████████████████████████████| 100%

Preporcessing

house_df['year'] = house_df['date'].substring(0,4).asnumeric() house_df['month'] = house_df['date'].substring(4,6).asnumeric() house_df['day'] = house_df['date'].substring(6,8).asnumeric() house_df = house_df.drop('date') house_df.head()
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 year month day
7.1293e+09 221900 3 1 1180 5650 1 0 0 3 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650 2014 10 13
6.4141e+09 538000 3 2.25 2570 7242 2 0 0 3 7 2170 400 1951 1991 98125 47.721 -122.319 1690 7639 2014 12 9
5.6315e+09 180000 2 1 770 10000 1 0 0 3 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062 2015 2 25
2.4872e+09 604000 4 3 1960 5000 1 0 0 5 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000 2014 12 9
1.9544e+09 510000 3 2 1680 8080 1 0 0 3 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503 2015 2 18
7.23755e+09 1.225e+06 4 4.5 5420 101930 1 0 0 3 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930 2014 5 12
1.3214e+09 257500 3 2.25 1715 6819 2 0 0 3 7 1715 0 1995 0 98003 47.3097 -122.327 2238 6819 2014 6 27
2.008e+09 291850 3 1.5 1060 9711 1 0 0 3 7 1060 0 1963 0 98198 47.4095 -122.315 1650 9711 2015 1 15
2.4146e+09 229500 3 1 1780 7470 1 0 0 3 7 1050 730 1960 0 98146 47.5123 -122.337 1780 8113 2015 4 15
3.7935e+09 323000 3 2.5 1890 6560 2 0 0 3 7 1890 0 2003 0 98038 47.3684 -122.031 2390 7570 2015 3 12
house_df.describe() 
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 year month day
type int int int real int int real int int int int int int int int int real real int int int int int
mins 1000102.0 75000.0 0.0 0.0 290.0 520.0 1.0 0.0 0.0 1.0 1.0 290.0 0.0 1900.0 0.0 98001.0 47.1559 -122.519 399.0 651.0 2014.0 1.0 1.0
mean 4580301520.864987 540088.1417665284 3.370841623097218 2.114757321982139 2079.899736269819 15106.96756581695 1.4943089807060526 0.007541757275713691 0.23430342849211097 3.4094295100171164 7.6568731781798105 1788.3906907879518 291.50904548188555 1971.0051357979064 84.4022579003377 98077.93980474674 47.56005251931665 -122.21389640494158 1986.5524915560036 12768.45565169118 2014.3229537778102 6.574422801091883 15.688196918521294
maxs 9900000190.0 7700000.0 33.0 8.0 13540.0 1651359.0 3.5 1.0 4.0 5.0 13.0 9410.0 4820.0 2015.0 2015.0 98199.0 47.7776 -121.315 6210.0 871200.0 2015.0 12.0 31.0
sigma 2876565571.3120522 367127.19648270035 0.930061831147451 0.7701631572177408 918.4408970468095 41420.51151513551 0.5399888951423489 0.08651719772788766 0.7663175692736117 0.6507430463662044 1.1754587569743344 828.0909776519175 442.57504267746685 29.373410802386235 401.67924001917555 53.50502625747248 0.13856371024192368 0.14082834238139297 685.3913042527788 27304.179631338524 0.4676160310451536 3.1153077787263648 8.635062534286034
zeros 0 0 13 10 0 0 0 21450 19489 0 0 0 13126 0 20699 0 0 0 0 0 0 0 0
missing 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 7129300520.0 221900.0 3.0 1.0 1180.0 5650.0 1.0 0.0 0.0 3.0 7.0 1180.0 0.0 1955.0 0.0 98178.0 47.5112 -122.257 1340.0 5650.0 2014.0 10.0 13.0
1 6414100192.0 538000.0 3.0 2.25 2570.0 7242.0 2.0 0.0 0.0 3.0 7.0 2170.0 400.0 1951.0 1991.0 98125.0 47.721000000000004 -122.319 1690.0 7639.0 2014.0 12.0 9.0
2 5631500400.0 180000.0 2.0 1.0 770.0 10000.0 1.0 0.0 0.0 3.0 6.0 770.0 0.0 1933.0 0.0 98028.0 47.7379 -122.233 2720.0 8062.0 2015.0 2.0 25.0
3 2487200875.0 604000.0 4.0 3.0 1960.0 5000.0 1.0 0.0 0.0 5.0 7.0 1050.0 910.0 1965.0 0.0 98136.0 47.5208 -122.393 1360.0 5000.0 2014.0 12.0 9.0
4 1954400510.0 510000.0 3.0 2.0 1680.0 8080.0 1.0 0.0 0.0 3.0 8.0 1680.0 0.0 1987.0 0.0 98074.0 47.616800000000005 -122.045 1800.0 7503.0 2015.0 2.0 18.0
5 7237550310.0 1225000.0 4.0 4.5 5420.0 101930.0 1.0 0.0 0.0 3.0 11.0 3890.0 1530.0 2001.0 0.0 98053.0 47.6561 -122.005 4760.0 101930.0 2014.0 5.0 12.0
6 1321400060.0 257500.0 3.0 2.25 1715.0 6819.0 2.0 0.0 0.0 3.0 7.0 1715.0 0.0 1995.0 0.0 98003.0 47.3097 -122.327 2238.0 6819.0 2014.0 6.0 27.0
7 2008000270.0 291850.0 3.0 1.5 1060.0 9711.0 1.0 0.0 0.0 3.0 7.0 1060.0 0.0 1963.0 0.0 98198.0 47.4095 -122.315 1650.0 9711.0 2015.0 1.0 15.0
8 2414600126.0 229500.0 3.0 1.0 1780.0 7470.0 1.0 0.0 0.0 3.0 7.0 1050.0 730.0 1960.0 0.0 98146.0 47.5123 -122.337 1780.0 8113.0 2015.0 4.0 15.0
9 3793500160.0 323000.0 3.0 2.5 1890.0 6560.0 2.0 0.0 0.0 3.0 7.0 1890.0 0.0 2003.0 0.0 98038.0 47.3684 -122.031 2390.0 7570.0 2015.0 3.0 12.0
plt.hist(house_df.as_data_frame()['price'].tolist(), bins=np.linspace(0,10**6,1000)) plt.show()

We shall use cross-validation and not a validation dataset.

train, test = house_df.split_frame(ratios=[0.9], destination_frames = ['train', 'test'], seed=123) print("%d/%d" %(train.nrows, test.nrows)) # 19462/2151 ignore_fields = ['id', 'price'] x = [i for i in train.names if not i in ignore_fields] y = 'price' 

2. Let’s now train at least four different models on the preprocessed datseet, using at least three different supervised algorithms. Let’s save all the models.

from h2o.estimators.gbm import H2OGradientBoostingEstimator from h2o.estimators.random_forest import H2ORandomForestEstimator from h2o.estimators.glm import H2OGeneralizedLinearEstimator from h2o.estimators.deeplearning import H2ODeepLearningEstimator from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator nfolds = 5 # for cross-validation 

Let’s first fit a GLM model. The best performing α hyperparameter value (for controlling L1 vs. L2 regularization) for GLM will be found using GridSearch, as shown in the below code snippet.

g= h2o.grid.H2OGridSearch( H2OGeneralizedLinearEstimator(family="gaussian", nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True, lambda_search=True), hyper_params={ "alpha":[x * 0.01 for x in range(0,100)], }, search_criteria={ "strategy":"RandomDiscrete", "max_models":8, "stopping_metric": "rmse", "max_runtime_secs":60 } ) g.train(x, y, train) g #glm Grid Build progress: |████████████████████████████████████████████████| 100% # alpha \ #0 [0.61] #1 [0.78] #2 [0.65] #3 [0.13] #4 [0.35000000000000003] #5 [0.05] #6 [0.32] #7 [0.55] # model_ids residual_deviance #0 Grid_GLM_train_model_python_1628864392402_41_model_3 2.626981989511134E15 #1 Grid_GLM_train_model_python_1628864392402_41_model_6 2.626981989511134E15 #2 Grid_GLM_train_model_python_1628864392402_41_model_5 2.626981989511134E15 #3 Grid_GLM_train_model_python_1628864392402_41_model_2 2.626981989511134E15 #4 Grid_GLM_train_model_python_1628864392402_41_model_4 2.626981989511134E15 #5 Grid_GLM_train_model_python_1628864392402_41_model_7 2.626981989511134E15 #6 Grid_GLM_train_model_python_1628864392402_41_model_0 2.626981989511134E15 #7 Grid_GLM_train_model_python_1628864392402_41_model_1 2.626981989511134E15 

Model 1

model_GLM= H2OGeneralizedLinearEstimator( family='gaussian', #'gamma', model_id='glm_house', nfolds=nfolds, alpha=0.61, fold_assignment="Modulo", keep_cross_validation_predictions=True) %time model_GLM.train(x, y, train) #glm Model Build progress: |███████████████████████████████████████████████| 100% #Wall time: 259 ms model_GLM.cross_validation_metrics_summary().as_data_frame()
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 230053.23 715.8795 229225.16 230969.69 228503.45 230529.47 231038.42
1 mean_residual_deviance 1.31780157E11 4.5671977E9 1.32968604E11 1.41431144E11 1.31364495E11 1.32024402E11 1.21112134E11
2 mse 1.31780157E11 4.5671977E9 1.32968604E11 1.41431144E11 1.31364495E11 1.32024402E11 1.21112134E11
3 null_deviance 5.25455325E14 1.80834544E13 5.3056184E14 5.636807E14 5.23549568E14 5.26203388E14 4.83281095E14
4 r2 0.023522535 4.801036E-4 0.024299357 0.023168933 0.022531934 0.023340257 0.024272196
5 residual_deviance 5.12943247E14 1.7808912E13 5.17646773E14 5.5059142E14 5.11270625E14 5.13838982E14 4.71368433E14
6 rmse 362905.53 6314.0225 364648.6 376073.3 362442.4 363351.62 348011.7
7 rmsle 0.53911585 0.0047404445 0.54277176 0.5389013 0.5275475 0.53846484 0.54789394
model_GLM.model_performance(test) #ModelMetricsRegressionGLM: glm #** Reported on test data. ** #MSE: 128806123545.59714 #RMSE: 358895.7000934911 #MAE: 233890.6933813204 #RMSLE: 0.5456714021880726 #R^2: 0.03102347771355851 #Mean Residual Deviance: 128806123545.59714 #Null degrees of freedom: 2150 #Residual degrees of freedom: 2129 #Null deviance: 285935013037402.7 #Residual deviance: 277061971746579.44 #AIC: 61176.23965800522

As can be seen from above, GLM could not achieve the target of RMSE below $123k neither on cross-validation nor on test dataset.

The below models (GBMDRF and DL) and the corresponding parameters were found with AutoML leaderboard and 
GridSearch, along with some manual tuning.

from h2o.automl import H2OAutoML model_auto = H2OAutoML(max_runtime_secs=60, seed=123) model_auto.train(x, y, train) # AutoML progress: |████████████████████████████████████████████████████████| 100% # Parse progress: |█████████████████████████████████████████████████████████| 100% model_auto.leaderboard
model_id mean_residual_deviance rmse mae rmsle
GBM_grid_0_AutoML_20210814_005121_model_0 2.01725e+10 142030 77779.1 0.184269
GBM_grid_0_AutoML_20210814_005121_model_1 2.6037e+10 161360 93068.1 0.218365
DRF_0_AutoML_20210814_005121 3.27251e+10 180901 102782 0.243474
XRT_0_AutoML_20210814_005121 3.53492e+10 188014 104259 0.246899
GBM_grid_0_AutoML_20210813_201225_model_0 5.99803e+10 244909 153548 0.351959
GBM_grid_0_AutoML_20210813_201225_model_2 6.09613e+10 246903 152570 0.349919
GBM_grid_0_AutoML_20210813_201225_model_1 6.09941e+10 246970 153096 0.350852
GBM_grid_0_AutoML_20210813_201225_model_3 6.22174e+10 249434 153105 0.350598
DeepLearning_0_AutoML_20210813_201225 6.39672e+10 252917 163993 0.378761
DRF_0_AutoML_20210813_201225 6.76936e+10 260180 158078 0.360337
model_auto.leader.model_performance(test) # model_auto.leader.explain(test) #ModelMetricsRegression: gbm #** Reported on test data. ** #MSE: 17456681023.716145 #RMSE: 132123.73376390839 #MAE: 77000.00253466706 #RMSLE: 0.1899899418603569 #Mean Residual Deviance: 17456681023.716145 model = h2o.get_model(model_auto.leaderboard[4, 'model_id']) # get model by model_id print(model.params['model_id']['actual']['name']) print(model.model_performance(test).rmse()) [(k, v) for (k, v) in model.params.items() if v['default'] != v['actual'] and \ not k in ['model_id', 'training_frame', 'validation_frame', 'nfolds', 'keep_cross_validation_predictions', 'seed', 'response_column', 'fold_assignment', 'ignored_columns']] # GBM_grid_0_AutoML_20210813_201225_model_0 # 235011.60404473927 # [('score_tree_interval', {'default': 0, 'actual': 5}), # ('ntrees', {'default': 50, 'actual': 60}), # ('max_depth', {'default': 5, 'actual': 6}), # ('min_rows', {'default': 10.0, 'actual': 1.0}), # ('stopping_tolerance', {'default': 0.001, 'actual': 0.008577452408351779}), # ('seed', {'default': -1, 'actual': 123}), # ('distribution', {'default': 'AUTO', 'actual': 'gaussian'}), # ('sample_rate', {'default': 1.0, 'actual': 0.8}), # ('col_sample_rate', {'default': 1.0, 'actual': 0.8}), # ('col_sample_rate_per_tree', {'default': 1.0, 'actual': 0.8})]

Model 2

model_GBM = H2OGradientBoostingEstimator( model_id='gbm_house', nfolds=nfolds, ntrees=500, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123) %time model_GBM.train(x, y, train) #gbm Model Build progress: |███████████████████████████████████████████████| 100% #Wall time: 54.9 s model_GBM.cross_validation_metrics_summary().as_data_frame()
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 64136.496 912.2387 62751.688 66573.63 63946.31 63873.707 63537.137
1 mean_residual_deviance 1.38268457E10 1.43582912E9 1.24595825E10 1.75283814E10 1.2894718E10 1.43893801E10 1.18621655E10
2 mse 1.38268457E10 1.43582912E9 1.24595825E10 1.75283814E10 1.2894718E10 1.43893801E10 1.18621655E10
3 r2 0.8979097 0.0075696795 0.90857375 0.87893564 0.9040519 0.89355356 0.90443367
4 residual_deviance 1.38268457E10 1.43582912E9 1.24595825E10 1.75283814E10 1.2894718E10 1.43893801E10 1.18621655E10
5 rmse 117288.305 5928.7188 111622.5 132394.8 113554.914 119955.74 108913.57
6 rmsle 0.16441989 0.0025737707 0.16231671 0.17041409 0.15941188 0.16528262 0.16467415

As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 117288.305, which is below $123k.

model_GBM.model_performance(test) #ModelMetricsRegression: gbm #** Reported on test data. ** #MSE: 14243079402.729088 #RMSE: 119344.37315068142 #MAE: 65050.344749203745 #RMSLE: 0.16421689257411975 #Mean Residual Deviance: 14243079402.729088

As can be seen from above, GBM could achieve the target of RMSE below $123k on test dataset.

Now, let’s try random forest model by finding best parameters with Grid Search:

g= h2o.grid.H2OGridSearch( H2ORandomForestEstimator( nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123), hyper_params={ "ntrees": [20, 25, 30], "stopping_tolerance": [0.005, 0.006, 0.0075], "max_depth": [20, 50, 100], "min_rows": [5, 7, 10] }, search_criteria={ "strategy":"RandomDiscrete", "max_models":10, "stopping_metric": "rmse", "max_runtime_secs":60 } ) g.train(x, y, train) #drf Grid Build progress: |████████████████████████████████████████████████| 100% g # max_depth min_rows ntrees stopping_tolerance \ #0 100 5.0 20 0.006 #1 100 5.0 20 0.005 #2 100 5.0 20 0.005 #3 100 7.0 30 0.006 #4 50 10.0 25 0.006 #5 50 10.0 20 0.005 # model_ids residual_deviance #0 Grid_DRF_train_model_python_1628864392402_40_model_0 2.0205038467456142E10 #1 Grid_DRF_train_model_python_1628864392402_40_model_5 2.0205038467456142E10 #2 Grid_DRF_train_model_python_1628864392402_40_model_1 2.0205038467456142E10 #3 Grid_DRF_train_model_python_1628864392402_40_model_3 2.099520493338354E10 #4 Grid_DRF_train_model_python_1628864392402_40_model_2 2.260686283035833E10 #5 Grid_DRF_train_model_python_1628864392402_40_model_4 2.279037520277947E10 

Model 3

model_RF = H2ORandomForestEstimator( model_id='rf_house', nfolds=nfolds, ntrees=20, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123) %time model_RF.train(x, y, train) #drf Model Build progress: |███████████████████████████████████████████████| 100% #Wall time: 13.2 s model_RF.cross_validation_metrics_summary().as_data_frame()
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 72734.0 1162.9153 73242.26 75062.21 73461.65 71646.195 70257.7
1 mean_residual_deviance 1.8545494E10 2.2018921E9 1.79095654E10 2.45911347E10 1.74433321E10 1.71117425E10 1.56716954E10
2 mse 1.8545494E10 2.2018921E9 1.79095654E10 2.45911347E10 1.74433321E10 1.71117425E10 1.56716954E10
3 r2 0.8632202 0.011770816 0.8685827 0.8301549 0.8702062 0.8734147 0.8737426
4 residual_deviance 1.8545494E10 2.2018921E9 1.79095654E10 2.45911347E10 1.74433321E10 1.71117425E10 1.56716954E10
5 rmse 135742.78 7726.2373 133826.62 156815.61 132073.2 130811.86 125186.64
6 rmsle 0.18275535 0.0020155373 0.18441868 0.18689767 0.17945778 0.1833288 0.17967385
model_RF.model_performance(test) ModelMetricsRegression: drf ** Reported on test data. ** MSE: 16405336914.530426 RMSE: 128083.3202041953 MAE: 71572.37981480274 RMSLE: 0.17712324625977907 Mean Residual Deviance: 16405336914.530426

As can be seen from above, DRF just missed the target of RMSE below $123k for on both the cross-validation and on test dataset.

Now, let’s try to fit a deep learning model, again tuning the parameters with Grid Search.

g= h2o.grid.H2OGridSearch( H2ODeepLearningEstimator( nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True, reproducible=True, seed=123), hyper_params={ "epochs": [20, 25], "hidden": [[20, 20, 20], [25, 25, 25]], "stopping_rounds": [0, 5], "stopping_tolerance": [0.006] }, search_criteria={ "strategy":"RandomDiscrete", "max_models":10, "stopping_metric": "rmse", "max_runtime_secs":60 } ) g.train(x, y, train) g #deeplearning Grid Build progress: |███████████████████████████████████████| 100% # epochs hidden stopping_rounds stopping_tolerance \ #0 16.79120554889533 [25, 25, 25] 0 0.006 #1 3.1976799968879086 [25, 25, 25] 0 0.006 # model_ids \ #0 Grid_DeepLearning_train_model_python_1628864392402_55_model_0 #1 Grid_DeepLearning_train_model_python_1628864392402_55_model_1 # residual_deviance #0 1.6484562934855278E10 #1 2.1652538389322113E10 

Model 4

model_DL = H2ODeepLearningEstimator(epochs=30, model_id='dl_house', nfolds=nfolds, stopping_rounds=7, stopping_tolerance=0.006, hidden=[30, 30, 30], reproducible=True, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123 ) %time model_DL.train(x, y, train) #deeplearning Model Build progress: |██████████████████████████████████████| 100% #Wall time: 55.7 s model_DL.cross_validation_metrics_summary().as_data_frame()
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 72458.19 1241.8936 71992.18 73569.984 75272.75 70553.38 70902.65
1 mean_residual_deviance 1.48438886E10 5.5005555E8 1.42477005E10 1.59033723E10 1.54513889E10 1.48586271E10 1.37583514E10
2 mse 1.48438886E10 5.5005555E8 1.42477005E10 1.59033723E10 1.54513889E10 1.48586271E10 1.37583514E10
3 r2 0.8899759 0.0023493338 0.89545286 0.8901592 0.885028 0.89008224 0.88915724
4 residual_deviance 1.48438886E10 5.5005555E8 1.42477005E10 1.59033723E10 1.54513889E10 1.48586271E10 1.37583514E10
5 rmse 121793.58 2259.6975 119363.734 126108.58 124303.62 121895.97 117296.0
6 rmsle 0.18431115 0.0011469581 0.18251595 0.18650953 0.18453318 0.18555655 0.18244053

As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 121793.58, which is below $123k.

model_DL.model_performance(test) #ModelMetricsRegression: deeplearning #** Reported on test data. ** #MSE: 14781990070.095192 #RMSE: 121581.20771770278 #MAE: 72522.60487846025 #RMSLE: 0.1834924698171073 #Mean Residual Deviance: 14781990070.095192

As can be seen from above, the deep learning model could achieve the target of RMSE below $123k on test dataset.

3. Finally, let’s train a stacked ensemble of the models created in earlier steps. We may need to repeat steps two and three until the best model (which is usually the ensemble model, but does not have to be) has the minimum required performance on the cross-validation dataset. Note: only one model has to achieve the minimum required performance. If multiple models achieve it, so we need to choose the best performing one.

models = [model_GBM.model_id, model_RF.model_id, model_DL.model_id] #model_GLM.model_id, model_SE = H2OStackedEnsembleEstimator(model_id = 'se_gbm_dl_house', base_models=models) %time model_SE.train(x, y, train) #stackedensemble Model Build progress: |███████████████████████████████████| 100% #Wall time: 2.67 s #model_SE.model_performance(test) #ModelMetricsRegressionGLM: stackedensemble #** Reported on test data. ** #MSE: 130916347835.45828 #RMSE: 361823.6418967924 #MAE: 236448.3672215734 #RMSLE: 0.5514878971097109 #R^2: 0.015148783736682492 #Mean Residual Deviance: 130916347835.45828 #Null degrees of freedom: 2150 #Residual degrees of freedom: 2147 #Null deviance: 285935013037402.7 #Residual deviance: 281601064194070.75 #AIC: 61175.193832813566

As can be seen from above, the stacked ensemble model could not reach the required performance, neither on the cross-validation, nor on the test dataset.

4. Now let’s get the performance on the test data of the chosen model/ensemble, and confirm that this also reaches the minimum target on the test data.

Best Model

The model that performs best in terms of mean cross-validation RMSE and RMSE on the test dataset (both of them are below the minimum target $123k) is the gradient boositng model (GBM), which is the Model 2 above.

model_GBM.model_performance(test) #ModelMetricsRegression: gbm #** Reported on test data. ** #MSE: 14243079402.729088 #RMSE: 119344.37315068142 #MAE: 65050.344749203745 #RMSLE: 0.16421689257411975 #Mean Residual Deviance: 14243079402.729088 # save the models h2o.save_model(model_GBM, 'best_model (GBM)') # the final best model h2o.save_model(model_SE, 'SE_model') h2o.save_model(model_GBM, 'GBM_model') h2o.save_model(model_RF, 'RF_model') h2o.save_model(model_GLM, 'GLM_model') h2o.save_model(model_DL, 'DL_model')

Source Prolead brokers usa

top reasons to use python language for web application development
Top Reasons to Use Python Language for Web Application Development

A reputed TIOBE index has considered Python as the major and one of the most popular programming languages for web and web app development. It is an extremely powerful, flexible, and advanced language for web design and development. Python development services gain ground among entrepreneurs globally for these reasons.  Let’s discuss these reasons in this post. 

Python has an upper hand over other programming languages when it comes to developing highly functional programming for enterprise websites and web applications. With the addition of various advancements, Python app development can easily meet the complexities and diverse business challenges. Python app developers can take the advantage of the versatility of this language to build efficient web app solutions. 

It is sufficient to know the importance of Python that software giants like Google, Facebook, and Microsoft bank on this programming language. Let’s understand why Python is a preferred programming language for web application development. But before digging deep into these reasons, let’s have a brief introduction to Python. 

What is Python Language?

It is a highly adaptable and efficient programming language with dynamic typing capabilities. It is useful for developing robust web and web application solutions. As a versatile programming language, Python enables developers to create all sorts of applications including scientific applications, graphics-based system applications, games, command-line utilities, etc. Python consultants can shed light on its usage. 

As an open-source programming language, Python offers unrestricted copying, embedding, and distribution of the code. What’s more, Python developers can get all the coding information online with ease. As a result, a Python development company can come up with flexible and feature-rich web solutions. Python can give enterprises an edge over peers by offering seamless and future-ready solutions. 

Python app development is steadily gaining popularity among entrepreneurs who want to integrate advancements of emerging technologies including AI, ML, and IoT. It is possible to bring automation in certain processes with the help of Python-based websites. Companies can hire Python developers to achieve this objective and get success in this challenging time. Let us go through how different web app development domains use this language. 

Python Use Cases across Various Web Development Domains

AI (Artificial Intelligence) and ML (Machine Learning)

Python is one of the most preferred programming languages for integrating AI and ML in customized web solutions. It is useful for making the computer ready for ML and assisting AI to analyze large volumes of data. Python-based websites and web applications can easily deal with high web traffic and fetch user data. 

Internet of Things (IoT)

Cameras and other in-built tools of the laptop or smartphones can be easily connected to the Internet as and when necessary in Python web applications. Python-powered business websites are capable of managing the existing IoT network when it comes to fetching and sharing valuable data. 

Deep Learning

Web applications based on Python support robotics and image recognition. Deep Learning is useful for processing data in a way similar to that of our brain. Python app development services can assist entrepreneurs to bring innovative and intelligent web applications. 

Today, hundreds of thousands of developers use Python for web and web app development. A recent Stackoverflow survey has shown Python as one of the highest in-demand programming languages. Python is a preferred language among developers, and many web developers want to learn it. 

Let’s dig deep into the reasons why Python is a preferred language for web and web app development projects. 

Top Reasons Why Developers Select Python for Web Development Projects

Python is not a new language. It has been around us since the 90s, but it has evolved in line with the market trends and changing expectations. 

Secure Language

When you hire Python developers, you can remain assured of the security and scalability of the web application. A thriving fintech sector prefers Python language for its high security and capability of handling large amounts of data. Senior and experienced Python developers can come up with a functional fintech app with military-level security. Also, developers can find solutions to common issues of Python web development thanks to a thriving community. 

Large and Robust Library

There is no exaggeration in mentioning that there is a Python library for everything. Whether entrepreneurs need an elegant website with seamless functionality or a secure and feature-rich web app, the Python library enables developers to build robust web solutions. The world’s most popular Machine Learning (ML) library facilitates Python web developers to integrate machine learning capabilities in the customized web app. SQLAlchemy library enables developers to give the power of SQL in the app or website. Python language is capable of enterprise web development patterns containing a simple database with the help of an SQLAlchemy library. 

Django Framework

This is one of the biggest reasons for choosing Python for developing complex web applications. Django is the main web development framework with a highly useful collection of libraries. As a flexible and comprehensive platform for developing any type of web apps, Django can build powerful apps for modern enterprises. You can hire Python Django developers for building user-friendly web apps for your business. Django takes away the pain of the development process and developers can readily focus on demanding tasks instead of basic issues. 

Python web development also offers Flask, a polar opposite of Django. Flask is a microframework and has much fewer ready-made parts than Django. However, this platform is not as flexible as Django. Talking about the differences- Django can save the developer’s time whereas Flask requires more time to adapt to changing requirements. 

AI and ML Advantages

AI and machine learning technologies are the need of the hour. With Python, you can integrate the functionality of these emerging technologies. This is one of the major reasons for Python’s increasing popularity. It results in a large number of developers who have professional experience in integrating AI-based features into enterprise apps. You can also find many Python developers with ease. In other words, it is much easier to hire Python developers than to hire C++ or other web developers. 

Final Thoughts

When it comes to performance, Python is great. Availability of developers and rich libraries are other big reasons why you should prefer Python for your upcoming web project. A wider talent pool is available for the Python language as compared to other programming languages. You can soon initiate the MVP (Minimum Viable Product) or a big web project using Python. 

All you need to do is consult a reputed Python development company or meet experienced Python consultants to build a team quickly and start the development process as soon as possible for your enterprise.

Source Prolead brokers usa

digital transformation through iot
Digital Transformation Through IoT

Time is evolving minute by minute and day-by-day. Currently, at this point there is no need to have a great product or service if you want to satisfy your customer requirements or retain the market.

One of the methods to place your business apart from the competition is by supporting innovation and adopting new technologies. That is the reason why organizations bet on digital transformation trying to remain significant and keep up with the market necessities. 

With the emerging IoT technologies in digital transformation, there are various factors that enhance the IoT utility and drive its growth. Data is valuable and AI is making data actionable by supporting digital IoT apps to provide predictive and prospective analytics.

In this article, you will know how IoT affects digital transformation.

What Does Digital Transformation Means For Business?

As indicated by the State of Digital Transformation research, market pressure is the major drive for digital transformation as even well-known market leaders battle to compete with tech-empowered, agile businesses and startups. 

Digital transformation is the best way to future-confirm your business and survive during tech disruption.

Along with the customer expectations, more companies are required to change the current business processes (or make totally new ones) with the assistance of technologies, for example leaving on the path of digital transformation.

From the customer experience that is offered to how you handle your internal processes, digital transformation significantly affects all parts of your business, both internal and external.

Advantages Of Digital Transformation

 

Improves customer experience

Providing digital and advanced tools to the customers assists with making their lives simple and easy. It makes the business more appealing to potential customers. Organizations that offer obsolete tools and technologies will experience trouble competing with those who utilize new and updated technologies.

 

Empowers data-driven decision-making 

Digital Transformation enables organizations to carry out data-driven management by utilizing an assortment of tools for tracking metrics and data analysis. This, thus, assists in providing a better outcome and improving supply chain performance.

 

Improved efficiency 

Inventive programming tools for process automation leads to further developed proficiency, which thus, brings about cost savings and decreases friction in the business.

 

Greater security 

By changing to modern software frameworks, organizations can secure their data in a better way. Today, customers are very much aware of data security issues, so this is the best method to win their trust and loyalty.

 

How Does The Internet Of Things Affect Digital Transformation?

There are numerous startups, whose entire business model is developed around the IoT product line. However, traditional organizations across various different spaces can likewise profit by introducing  emerging IoT technology solutions to fuel their established business measures.

There are various ways you can transform your business using IoT. Here are some of the methods in which IoT is driving digital transformation and increasing the demand for IoT App development:

  • Starting new business opportunities

By utilizing information generated by IoT devices, organizations can better understand their customers’ requirements and change their product offerings likewise and also present new products or services to cater to a more vast crowd.

 

  • Delivering meaningful, tailored customer experience

By profiting on new sources of consumer information, for example with IoT devices, organizations can acquire deep knowledge into the customer behavior and tailor their customer experience accordingly – through cutting edge personalization and increased availability.

 

  • Boosting business efficiency

Merging rich data experiences with autonomous sensors, the internet of things mobile applications has the potential to build business productivity through process automation. There are many eminent processes that can be streamlined, including stock management, logistic management, security, energy maintenance, and so on.

 

  • Reducing operating costs

Process automation will inevitably lead to cost savings and will allow you to use resources in a wise manner. For instance, IoT energy solutions can assist you with managing utility consumption and disposal of waste. This methodology can be applied to warming, ventilation and air conditioning systems, lighting, water supply, and so on.

 

  • Improving employee productivity

Very much like cloud and mobile technology advancements, IoT can assist you with engaging your staff, offering better dexterity and making your business system accessible anytime and anywhere. Smart sensors can keep employees connected all time and convey real-time experiences for better productivity.

Bottom Line

The emerging IoT technology has led businesses to work in smart ways by connecting devices and placing real-time information to customers and employees, to provide a personalized and satisfying experience. With IoT transformation, there is secure integration into business processes and workflow. 

It is advised that organizations should get their technology stack in place to brace the impact that new technologies like IoT and Digital transformation will bring.

Source Prolead brokers usa

understanding probabilistic programming
Understanding Probabilistic Programming

Even for many data scientists, Probabilistic Programming is a relatively unfamiliar territory. Yet, it is an area fast gaining in importance.

In this post, I explain briefly the exact problem being addressed by Probabilistic Programming

We can think of Probabilistic Programming as a tool for statistical modelling.

Probabilistic Programming has randomization at its core and the goal of Probabilistic Programming is to provide a statistical analysis that explains a phenomenon.

Probabilistic Programming is based on the idea of latent random variables which allow us to model uncertainty in a phenomenon. In turn, statistical inference in this case involves determining the values of these latent variables

A probabilistic programming language is based on a few primitives: we have a set of primitives for drawing random numbers, primitives for computing probabilities and expectations by conditioning and finally primitives for probabilistic inference

A PPL works a bit differently from traditional machine learning languages. The prior distributions are encoded as assumptions in the model.  In the inference stage, the posterior distributions of the parameters of the model are computed based on observed data i.e., inference adjusts the prior probabilities based on observed data.

All this sounds a bit abstract. But how do you use it?

One way could be by Bayesian Probabilistic Graphical models implemented through packages like pymc3

Another way is to combine deep learning with PPLs by Deep PPLs implemented through packages like Tensorflow Probability

For more about Probabilistic deep learning, see

Probabilistic Deep Learning with Probabilistic Neural Networks and …

Finally, its important to emphasise that probabilistic programming takes a different approach to traditional model building

In traditional CS/ machine learning models, the model is defined by parameters which generate the output. In statistical/ Bayesian programming the parameters are not fixed / predetermined. Instead, we starat with a generative process and the parameters are determined as part of the inference based on the inputs

In subsequent posts, we will expand on these ideas in detail.

Image source:   Tensorflow probability

References

https://www.cs.cornell.edu/courses/cs4110/2016fa/lectures/lecture33…

https://www.math.ucdavis.edu/~gravner/MAT135B/materials/ch11.pdf

https://medium.com/swlh/a-gentle-introduction-to-probabilistic-prog…

Source Prolead brokers usa

how to use ai for intelligent inventory management
How to Use AI for Intelligent Inventory Management

Artificial Intelligence (AI) is highly demanded practically in every industry. The greatest example of the successful usage of top-notch technology is the retailers and other e-commerce companies, especially, their inventory management system. AI provides powerful insights for organizations like trends identified from large volumes of data analyzed so that business owners and their warehouse teams can better manage the daily tasks of inventory management.

Improved decision-making, reduced costs, eliminated risks, optimized warehouse work, increased productivity are just a few benefits of the implementation of AI technology. According to the statistics, in 2020 about 45.1% of companies have already invested in automation of the warehouse and 40.1% in AI solutions. 

5 Ways To Use Artificial Intelligence For Inventory Management

It’s estimated that AI can add $1.3 trillion to the global economy in the next twenty years if the technology is used in supply chain and logistics management. It’s because AI can make supply chain management more efficient at all stages. 

Nvidia, IBM, Amazon, Facebook, Microsoft, Salesforce, Alteryx, Twilio, Tencent, Alphabet are a few big names among the companies that have already leveraged the benefits of AI. The following are 5 ways that AI is revolutionizing inventory management.

1. Data Mining and Turning It Into Solutions

AI is extremely helpful in data mining. AI solutions have the ability not only to gather but analyze the data to transform it into timely actions. Thus, AI implemented into the inventory management system helps the business to evolve more rapidly and find more effective solutions to a particular situation. By monitoring, gathering, recording, and processing the data and interests of every customer, businesses can understand their customers’ demands to build more effective strategies and pre-plan the needs of the customers and stock products.

2. Dealing with Forecasting, Planning, and Control Issues In The Inventory Management Process

Inventory management is not only about storing and delivering items but it’s about forecasting, planning, and control. By implementing AI solutions, you minimize the risks of overstocking and understocking thanks to the ability of the technology to:

  • Accurately analyze and correlate demand insights;
  • Detect and respond to the change in demand for a specific product;
  • Consider location-specific demand.

AI-based solutions have the flexibility and ability to analyze all the possible factors and situations that are vital for the successful planning, stocking, and scheduling deliveries. Reducing the errors and issues in inventory management, the business can increase customer satisfaction and save costs.

3. Stock Management and Delivery

Planning errors and/or inadequate stock monitoring can result in shortages, delays, and other issues that affect the revenue. AI technology can be pretty helpful in it. The technology can collect the data about customers and analyze it to identify behavior patterns and other crucial factors that help:

  • Plan the stocking right;
  • Automate the stocking and fulfillment processes;
  • Leverage and react to incoming customer demands on time;
  • Establish efficient transportation and many more.

AI also can streamline deliveries and increase their efficiency. On-time deliveries and transportation are the fundamentals of supply chain management that have a huge impact on consumer satisfaction. AI analyzes and makes sense of all a company’s telematics data, helps to find the optimal routes to ensure the timely arrival of orders. Besides, the technology can identify any patterns and draw conclusions about the delivery processes of the company so that you can improve it.

4. AI-Powered Robots to Optimize Warehouse Operations

AI-powered robots are not a new thing. Such giants as Amazon have already used them for day-to-day tasks. It’s forecasted that the robot automation market’s value will reach $10+ billion by 2023. There are a number of benefits that put AI-based robots over human staff:

  • They can work 24/7 tirelessly;
  • Robots work with more optimal time per action;
  • They can locate wares and scan their conditions, collecting the needed data for further analysis;
  • They provide real-time tracking of products;
  • Robots can select and move orders, reducing manual errors;
  • They perform inventory optimization, and so on.

All that can save a business a big chunk of the operational budget. Besides that, AI-powered robots used in warehouses free employees so that they can be allocated for more urgent and vital tasks that require human cognition.

5. Logistics Route Optimization 

One of the most critical components in logistics is route optimization. By implementing AI solutions, companies can reduce time lost in traffic, provide faster delivery times, and in such a way save costs. That’s because AI can help in:

  • Lowering shipping costs by learning all the possible variants and finding the fastest and most cost-effective ways to deliver the orders to the customers.
  • Planning the most optimal routes. AI can learn traffic patterns over time, analyze the received data, and consider the different factors while routing. All that enables the drivers to avoid traffic jams more effectively.
  • Calculating more precise delivery time. Using complex algorithms, AI technology can calculate the delivery time more accurately by taking into account historical and real-time data, optimal routes, and other factors that can affect delivery efficiency.

Final Thoughts

AI has revolutionized and reshaped both inventory management and the way companies stock and store products. The AI solutions implemented to enable the businesses to make the inventory management pre-planned, automated, based on customer demands, and even carried out by robots. AI empowers companies to:

  • Enhance user experience and consumer satisfaction;
  • Increase sales;
  • Reduce costs;
  • Boost the overall productivity of the company.

AI is the future of the industry. Thus, if you want to stay competitive, you should implement the technology as soon as possible. The results can be outstanding.

Source Prolead brokers usa

synthetic image generation using gans
Synthetic Image Generation using GANs

Occasionally a novel neural network architecture comes along that enables a truly unique way of solving specific deep learning problems. This has certainly been the case with Generative Adversarial Networks (GANs), originally proposed by Ian Goodfellow et al. in a 2014 paper that has been cited more than 32,000 times since its publication. Among other applications, GANs have become the preferred method for synthetic image generation. The results of using GANs for creating realistic images of people who do not exist have raised many ethical issues along the way. 

In this blog post we focus on using GANs to generate synthetic images of skin lesions for medical image analysis in dermatology.

Figure 1 – How a generative adversarial network (GAN) works. 

A Quick GAN Lesson

Essentially, GANs consist of two neural network agents/models (called generator and discriminator) that compete with one another in a zero-sum game, where one agent’s gain is another agent’s loss. The generator is used to generate new plausible examples from the problem domain whereas the discriminator is used to classify examples as real (from the domain) or fake (generated). The discriminator is then updated to get better at discriminating real and fake samples in subsequent iterations, and the generator is updated based on how well the generated samples fooled the discriminator (Figure 1).

During its history, numerous architectural variations and improvements over the original GAN idea have been proposed in the literature. Most GANs today are at least loosely based on the DCGAN (Deep Convolutional Generative Adversarial Networks) architecture, formalized by Alec Radford, Luke Metz and Soumith Chintala in their 2015 paper.

You’re likely to see DCGAN, LAPGAN, and PGAN used for unsupervised techniques like image synthesis, and cycleGAN and Pix2Pix used for cross-modality image-to-image translation.

GANs for Medical Images

The use of GANs to create synthetic medical images is motivated by the following aspects:

  1. Medical (imaging) datasets are heavily unbalanced, i.e., they contain many more images of healthy patients than any pathology. The ability to create synthetic images (in different modalities) of specific pathologies could help alleviate the problem and provide more and better samples for a deep learning model to learn from.
  2. Manual annotation of medical images is a costly process (compared to similar tasks for generic everyday images, which could be handled using crowdsourcing or smart image labeling tools). If a GAN-based solution were reliable enough to produce appropriate images requiring minimal labeling/annotation/validation by a medical expert, the time and cost savings would be appealing.
  3. Because the images are synthetically generated, there are no patient data or privacy concerns.

Some of the main challenges for using GANs to create synthetic medical images, however, are:

  1. Domain experts would still be needed to assess quality of synthetic images while the model is being refined, adding significant time to the process before a reliable synthetic medical image generator can be deployed.
  2. Since we are ultimately dealing with patient health, the stakes involved in training (or fine-tuning) predictive models using synthetic images are higher than using similar techniques for non-critical AI applications. Essentially, if models learn from data, we must trust the data that these models are trained on.

The popularity of using GANs for medical applications has been growing at a fast pace in the past few years. In addition to synthetic image generation in a variety of medical domains, specialties, and image modalities, other applications of GANs such as cross-modality image-to-image translation (usually among MRI, PET, CT, and MRA) are also being researched in prominent labs, universities, and research centers worldwide.

In the field of dermatology, unsupervised synthetic image generation methods have been used to create high resolution synthetic skin lesion samples, which have also been successfully used in the training of skin lesion classifiers. State-of-the-art (SOTA) algorithms have been able to synthesize high resolution images of skin lesions which expert dermatologists could not reliably tell apart from real samples. Figure 2 shows examples of synthetic images generated by a recently published solution as well as real images from the training dataset.

Figure 2 – (L) synthetically generated images using state-of-the-art techniques;
(R) actual skin lesion images from a typical training dataset.

An example

Here is an example of how to use MATLAB to generate synthetic images of skin lesions.

The training dataset consists of annotated images from the ISIC 2016 challenge, Task 3 (Lesion classification) data set, containing 900 dermoscopic lesion images in JPEG format.

The code is based on an example using a more generic dataset, and then customized for medical images. It highlights MATLAB’s recently added capabilities for handling more complex deep learning tasks, including the ability to:

  • Create deep neural networks with custom layers, in addition to commonly used built-in layers.
  • Train deep neural networks with custom training loop and enabling automatic differentiation.
  • Process and manage mini-batches of images and using custom mini-batch processing functions.
  • Evaluate the model gradients for each mini-batch – and update the generator and discriminator parameters accordingly.

The code walks through creating synthetic images using GANs from start (loading and augmenting the dataset) to finish (training the model and generating new images).

One of the nicest features of using MATLAB to create synthetic images is the ability to visualize the generated images and score plots as the networks are trained (and, at the end of training, rewind and watch the entire process in a “movie player” type of interface embedded into the Live Script). Figure 3 shows a screenshot of the process after 600 epochs / 4200 iterations. The total training time for a 2021 M1 Mac mini with 16 GB of RAM and no GPU was close to 10 hours.

Figure 3 – Snapshot of the GAN after training for 600 epochs / 4200 iterations. On the left: 25 randomly selected generated images; on the right, generator (blue) and discriminator (red) curves showing score (between 0 and 1, where 0.5 is best) for each iteration (right).

Figure 4 shows additional examples of 25 randomly selected synthetically generated images after training has completed. The resulting images resemble skin lesions but are not realistic enough to fool a layperson, much less a dermatologist. They indicate that the solution works (notice how the images are very diverse in nature, capturing the diversity of the training set used by the discriminator), but they display several imperfections, among them: a noisy periodic pattern (in what appears to be an 8×8 grid of blocks across the image) and other visible artifacts. It is worth mentioning that the network has also learned a few meaningful artifacts (such as colorful stickers) that are actually present in a significant number of images from the training set.

Figure 4 – Examples of synthetically generated images.

Practical hints and tips

If you choose to go down the path of improving, expanding, and adapting the example to your needs, keep in mind that:

  1. Image synthesis using GANs is a very time-consuming process (just as most deep learning solutions). Be sure to secure as much computational resources as you can.
  2. Some things can go wrong and could be detected by inspecting the training progress, among them: convergence failure (when the generator and discriminator do not reach a balance during training, with one of them overpowering the other) and mode collapse (when the GAN produces a small variety of images with many duplicates and little diversity in the output). Our example doesn’t suffer from either problem.
  3. Your results may not look “great” (contrast Figure 4 with Figure 2), but that is to be expected. After all, in this example we are basically using the standard DCGAN (deep convolutional generative adversarial network) Specialized work in synthetic skin lesion image generation has moved significantly beyond DCGAN; SOTA solutions (such as the one by Bissoto et al. and the one by Baur et al.) use more sophisticated architectures, normalization options, and validation strategies.

Key takeaways

GANs (and their numerous variations) are here to stay. They are, according to Yann LeCun, “the coolest thing since sliced bread.” Many different GAN architectures have been successfully used for generating realistic (i.e., semantically meaningful) synthetic images, which may help training deep learning models in cases where real images are rare, difficult to find, and expensive to annotate.

In this blog post we have used MATLAB to show how to generate synthetic images of skin lesions using a simple DCGAN and training images from the ISIC archive.

Medical image synthesis is a very active research area, and new examples of successful applications of GANs in different medical domains, specialties, and image modalities are likely to emerge in the near future.  If you’re interested in learning more about it, check out this review paper and use our example as a starting point for further experimentation.

Source Prolead brokers usa

no code ai no kidding aye part ii
No Code AI, No Kidding Aye – Part II

Challenges addressed by No Code AI platforms

An AI model building is challenging on three fundamental counts:

  1. Availability of relevant data in good quantity and quality: The less I rant about it, the better.
  2. Need for multiple skills: Building an effective and monetizable AI model is not just the realm of a data scientist alone. It needs data engineering skills and domain knowledge also.
  3. The constant evolution of the ecosystem in terms of new techniques, approaches, methodologies, and tools

There is no easy way out to address the first challenge, at least not so far. So, let us brush that under the carpet for now.

The need for having multiple resources with complementing skills is an area where a no-code AI platform can add tremendous value. The average data scientist spends half of his/her time preparing and cleaning the data needed to build models and the other half fine-tuning the model for optimum performance. No Code AI platforms (such as Subex HyperSense) can step in with automated data engineering and ML programming accelerators that go a long way in alleviating the requirement of having a multi-skilled team.  What’s more, it empowers even Citizen Data Scientists with the ability to build competent AI models without having the need to know any programming language or having any background in data engineering. Platforms like HyperSense provide advanced automated data exploration, data preparation, and multi-source data integration capabilities using simple drag-and-drop interfaces. It combines this ability with a rich visual representation of the results at every step of the process so that one does not need to wait until the end to realize an error that was done in an early step and have to go back and make changes everywhere.

As I briefly touched upon a while back, getting the data ready is one-half of the battle won. The plethora of options on the other half is still perplexing – Is it a bird? Is it a plane? Oh no, it is Superman! Well, in our context – it would be more like – Is it DBSCAN? Is it a Gaussian Mixture? Oh no, it is K-Means! Feature engineering and experimenting with different algorithms to get the most optimum results is a specialized skill. It requires an in-depth understanding of the data set, domain knowledge, and principles of how various algorithms work. Here again, No Code AI platforms like HyperSense come to the table with significant value adds. With capabilities like autonomous feature engineering and multi-algorithm trial and benchmarking, I daresay that it makes building models almost child’s play. Please do not get me wrong. I am not for a moment suggesting that these platforms will result in the extinction of the technical data scientist role, on the contrary, it will make them more efficient and give them superpowers to solve greater problems in lesser time while managing and guiding teams of citizen data scientists to solve the more mundane, yet, problem statements of existential importance.

So far, so good; and having brushed one challenge under the carpet and discussed the other one, there is one more – The constant evolution of AI techniques, methodologies, tools, and technologies. Today, just being able to build a model which performs well on a pre-defined set of metrics does not cut ice anymore. It is just not enough for a model to be simply accurate. As the AI landscape evolves, the chorus for the Explainability and Accountability in models is reaching a fever pitch. Why did K-Means give you a better result than Gaussian Mixture? Will, you then get the same result if a feature was modified or a new one added? Why did the model predict a similar outcome for most customers belonging to a certain ethnicity? Is the model replicating the bias and vagaries present in the historical data set or the person building the model? If there have been policies and practices in a business where any sort of decision bias crept into day-to-day functioning, it is but natural that the data sets you work on will have those biases and the model you build will continue to persuade you to make decisions with the same biases as before. As an organization that is striving to disrupt and transform your industry, it is pertinent that you identify and weed out such biases sooner than later before your AI models hit scale and it becomes a wild animal out of its cage.

As No Code AI platforms evolve, model explainability is something that is already getting addressed. Platforms like HyperSense give you the option to open up the proverbial ‘black-box’ and peep inside to see why a model behaved the way it did. It provides the analyst or the data scientist with an opportunity to tinker around advanced settings and fine-tune them to meet the objectives. Model accountability and ethics is a whole different ball game altogether. It is not restricted just to technology but also the frailties of human beings as a species. I am sure the evolving AI ecosystem will eventually figure out a way to make the world free of human biases – but hey, where’s the fun then? Human biases do make the world interesting and despicable in equal measure and I believe the holy grail for AI will be to strike a balance between the two.

Until then, let us empower more and more creative and business stakeholders to explore and unleash the true power of AI using No Code platforms like HyperSense so that the world can be a better place for all life forms.

Source Prolead brokers usa

dsc weekly digest 10 august 2021
DSC Weekly Digest 10 August 2021

The most baleful aspects of the Pandemic seem to be behind us, though the emergence of the Delta variant of the COVID-19 virus is causing companies to question whether it is perhaps too early to shift operations completely back to the office, and months turn into years, the likelihood of a hybrid work model emerging as the dominant approach to work is becoming more and more likely.

This has a major impact upon the shape of work, especially for knowledge workers including data scientists, programmers, designers, and others who work primarily with information systems, as well as those who manage them. As machine learning systems become more integrated into day-to-day activities. Other areas that are also being transformed include education, in all its varied manifestations, entertainment, supply chain management, security, manufacturing, even criminal activity. 

As this process plays out, it is forcing a re-evaluation of nearly all aspects of work, including what productivity means in the AI era and whether or not such digital transformations (including Work From Home / Work from Anywhere) is beneficial or harmful to the economy. New DSC Columnist Michael Spencer, editor-in-chief of The Last Futurist, explores this theme in detail in this newsletter, asking whether the digital transformations that we’re seeing will come at the cost of local economies disappearing, especially in the entertainment and service sectors.

The entertainment sector is transforming in ways that would have been unthinkable ten years ago. Salesforce this week announced that they were launching their own business-oriented Streaming Service, even as companies such as Gamestop and AMC are on death watch on Wall Street. We are continuing the process of transforming atoms to bits then making these virtualized atoms transmissible through ever-faster networks. Scarlett Johannson took Disney to court about royalty revenues lost to streaming, which is likely to send shockwaves through the entertainment sector as creators use the opportunity to renegotiate how such creativity is compensated as the traditional movie theater gives way to the virtualization of location. At the same time, Disney’s last major animated project, Raya and the Last Dragon was completed almost completely from the homes of the various animators, editors and other creatives, to the extent that we may not be far from every actor having a green screen room in their house. 

Even in the service sector, the skills required (and the demands upon workers) are changing. Delivery has become the next sector to face automation, requiring the coordination of thousands of drivers and fulfillment specialists through the use of highly complex networked systems, often managed through the same kind of tracking tools formerly reserved for large-scale software projects. There is a generation of DIY home manufacturers who are becoming adept at managing such supply chain and distribution issues, and that in turn is shaping how (and where) business gets done.

Ultimately, what is happening is that geolocation is ceasing to be as major a factor as it once was, while at the same time I think that we’ll see the pendulum swinging back towards where local business should be. In my town of Issaquah, here in the Pacific Northwest, the local restaurants along Main Street (or Front Street, in this case) are now seeing more and more patrons, as are the barbers and hair salons, and even a bookstore or two after a few decades of them being destroyed by the large chains (a trend I’m seeing in other sectors as well). I think we’ll find a balance again, but it will be a different equilibrium. We still need that third place, neither home nor work but common ground to re-establish community. 

In media res,

Kurt Cagle
Community Editor,
Data Science Central

To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free! 

Source Prolead brokers usa

Pro Lead Brokers USA | Targeted Sales Leads | Pro Lead Brokers USA
error: Content is protected !!