Sa8svkgvup7bc65xi1jh

Shishir Joshi

Full Stack Data Scientist and self taught Machine Learning enthusiast with experience of CV, NLP, Deep Learning and backend development.


Pune IN
[email protected]

Ph. : +91 8554067867  |  +91 9405181761

Skills


Machine Learning

Transformers (BERT, DistillBERT),

LSTMs, CNNs, SVMs, KNN, Decision Trees, ULMFiT,

Transfer Learning, Deep Learning, Computer Vision,



APIs / Libraries

OpenCV, Tensorflow, Keras, PyTorch, Fast.ai, Huggingface,

Scikit-Learn, Pandas,

PySpark.


Development

Python, shell scripting, Java, SQL/PL-SQL, GIT, Django, Flask

AWS [ EC2, S3, SQS, RDS, Sagemaker ]


Work Experience

Data Scientist,

GlobalFoundries |  Mar 2021 ~ Present

: Predictive Maintenance Model of Semiconductor Etching Tools
  • Created multiple POCs for predictive maintenance of semiconductor mfg. tools based on Remaining Useful Life using existing tool sensors and yield data for the Singapore Facility.
  • XGBoost Regressor and Autoregressive LSTM models were developed for the same in AWS Sagemaker.
  • Deployed models on premises in shadow mode for validation on live data.
:  Probability of Failure analysis/ model on Etching Tools
  • Built Probability of Failure model using test wafer particle counts to predict tool's internal state.
  • Two part model is made of Wafer Particle (defect) Count Regressor (Linear Reg., RandomForest Reg., XGBoost Regressor were used for POCs) and Thresholded CDF of Negative Binomial distribution.
  • Used Maximum Likelihood Estimation to model tool state based on multiple internal sensors, semiconductor recipe information and control limit thresholds
  • MLE model used with Threshold CDF framework shows promising results in terms of extending tool uptime and predicting possible failures based on trends in defect measurement.

Kzsi9r1kvr5ny9cmod1h

Data Scientist,

Jash Data Sciences |  Feb 2020 ~ Feb 2021

: Document Similarity Semantic Search
  • Researched, created and served document level semantic similarity search engine for an AI based hiring tech startup.
  • Compared the performance of LSTM Seq2Seq based Autoencoder for language modeling task with DistilBERT model on custom evaluation metric based on semantic similarity.
  • Created complete backend API to serve the fine tuned DistilBERT model via Flask and Highly optimized document embedding.
  • Implemented Approximate Nearest Neighbor search using graph based HNSW clustering for near real time retrieval.
: Insurance Email Classification and document NER 
  • Created word embedding and keyword extraction based email classification model (Test Set F1 score: 0.85).
  • My model out performed ULMFiT model fine tuned for email classification on same dataset (ULMFiT Test Set F1 score: 0.72).
  • Worked on implementing DistilBERT based NER pipeline using Huggingface Transformers.
: Inhovate - Analytics platform focused on the hospitality industry
  • Automated complete ETL logic in Bash/Python.
  • Worked on Backend API in Django and wrote custom query builders to dynamically compose and execute complex queries beyond the scope of Django's ORM.
  • Created linux processes and cron jobs for ETL, web server with load balancing using HAProxy and Gunicorn.

Kzsi9r1kvr5ny9cmod1h

Software Engineer,

Larsen and Toubro Infotech, | Sep 2018 ~ Dec 2019
  • Development and extension of BRAINS core banking platform.
  • Implemented source code management functionality in-house, saving $25k yearly in licensing costs to outsourced system
  • Worked on pilot project for creation of defaulter classification using Gradient Boosted Decision Tree classifier in Scikit-Learn.
Company@2x

Projects (Computer Vision, NLP, etc.)

Click here for colab notebooks

Qualia
  • https://github/qualia
  • Online Real Time Semantic Search using Transformers and HNSW nearest neighbour search.
  • Can be fine tuned on specific datasets with custom tokenisation requirements.
  • Uses Sentence Transformers as embedding model and HNSWlib as approximate nearest neighbour search index based on cosine similarity.
  • The aim is to make it "Online" - to be able to add new documents in parallel with querying.


StackExchange Question tags extraction

  • https://colab/stx
  • Multilabel classification for tag extraction from Stackexchange questions.
  • Used transfer learning to fine tune ULMFiT Language model on dataset (83% language modeling accuracy),
  • Used beautifulsoup4 (bs4) and regex to clean text.
  • Created Multi-Label classifier using ULMFiT as embedding layer.
  • Achieved >93% accuracy on tag prediction.

Machine Learning Based Automatic Fruit Grading and Classification

  • Machine Learning Based Automatic Fruit Grading and Classification:Funded by the University of Pune.
  • Trained Inception V3 CNN model via transfer learning for detecting grade of fruits based on visual quality, skin texture, and pre-defined standards.
  • Used OpenCV for image preprocessing (cropping, segmentation and feature extraction for comparison of classification on SVM) Achieved ~90% accuracy on test set.

MobileNet v2.0 Transfer Learning on the Caltech101 Dataset with TF2.0:

  • Trained my custom CNN on the 101 categories of the Caltech101 dataset.
  • Prepared a tf.Data input pipeline, and compared performance with transfer-trained MobileNetv2.0 model.
  • My Model achieved ~98% accuracy while MobileNet achieved ~80%

Education

Savitribai Phule Pune University, Pune | 2015 ~ 2018

Bachelor of Engineering (B.E.) | Electronics and Telecommunication,
Graduated First Class with Distinction from
Pune Institute of Computer Technology (PICT), Pune