Data Science

Video Lectures:
  1. Data Science – Part I – Building Predictive Analytics Capabilities
  2. Data Science – Part II – Working with R & R Studio
  3. Data Science – Part III – EDA & Model Selection
  4. Data Science – Part IV – Regression Analysis and ANOVA Concepts
  5. Data Science – Part V Decision Trees & Random Forests
  6. Data Science – Part VI – Market Basket and Product Recommendation Engines
  7. Data Science – Part VII – Cluster Analysis

Further Reading:

R For Everyone: Advanced Analytics and graphics

Using the open source R language, you can build powerful statistical models to answer many of your most challenging questions. R has traditionally been difficult for non-statisticians to learn, and most R books assume far too much knowledge to be of help. R for Everyone, Second Edition, is the solution.

Drawing on his unsurpassed experience teaching new users, professional data scientist Jared P. Lander has written the perfect tutorial for anyone new to statistical programming and modeling. Organized to make learning easy and intuitive, this guide focuses on the 20 percent of R functionality you’ll need to accomplish 80 percent of modern data tasks.

Lander’s self-contained chapters start with the absolute basics, offering extensive hands-on practice and sample code. You’ll download and install R; navigate and use the R environment; master basic program control, data import, manipulation, and visualization; and walk through several essential tests. Then, building on this foundation, you’ll construct several complete models, both linear and nonlinear, and use some data mining techniques. After all this you’ll make your code reproducible with LaTeX, RMarkdown, and Shiny.

By the time you’re done, you won’t just know how to write R programs, you’ll be ready to tackle the statistical problems you care about most.

Coverage includes

  • Explore R, RStudio, and R packages
  • Use R for math: variable types, vectors, calling functions, and more
  • Exploit data structures, including data.frames, matrices, and lists
  • Read many different types of data
  • Create attractive, intuitive statistical graphics
  • Write user-defined functions
  • Control program flow with if, ifelse, and complex checks
  • Improve program efficiency with group manipulations
  • Combine and reshape multiple datasets
  • Manipulate strings using R’s facilities and regular expressions
  • Create normal, binomial, and Poisson probability distributions
  • Build linear, generalized linear, and nonlinear models
  • Program basic statistics: mean, standard deviation, and t-tests
  • Train machine learning models
  • Assess the quality of models and variable selection
  • Prevent overfitting and perform variable selection, using the Elastic Net and Bayesian methods
  • Analyze univariate and multivariate time series data
  • Group data via K-means and hierarchical clustering
  • Prepare reports, slideshows, and web pages with knitr
  • Display interactive data with RMarkdown and htmlwidgets
  • Implement dashboards with Shiny
  • Build reusable R packages with devtools and Rcpp

 

Paperback: 560 pages

Publisher: Addison Wesley; 2 edition (8 Jun. 2017)

Language: English

ISBN-10: 013454692X

ISBN-13: 978-0134546926

Product Dimensions: 17.8 x 2 x 23.1 cm

Buy at Amazon.co.uk

 

Data Just Right: Introduction to Large-Scale Data & Analytics

Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on “Big Data” have been little more than business polemics or product catalogs. Data Just Right is different: It’s a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist.

Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that’s where you can derive the most value.

Manoochehri shows how to address each of today’s key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You’ll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today’s leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery.

Coverage includes

  • Mastering the four guiding principles of Big Data success—and avoiding common pitfalls
  • Emphasizing collaboration and avoiding problems with siloed data
  • Hosting and sharing multi-terabyte datasets efficiently and economically
  • “Building for infinity” to support rapid growth
  • Developing a NoSQL Web app with Redis to collect crowd-sourced data
  • Running distributed queries over massive datasets with Hadoop, Hive, and Shark
  • Building a data dashboard with Google BigQuery
  • Exploring large datasets with advanced visualization
  • Implementing efficient pipelines for transforming immense amounts of data
  • Automating complex processing with Apache Pig and the Cascading Java library
  • Applying machine learning to classify, recommend, and predict incoming information
  • Using R to perform statistical analysis on massive datasets
  • Building highly efficient analytics workflows with Python and Pandas
  • Establishing sensible purchasing strategies: when to build, buy, or outsource
  • Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist

Paperback: 256 pages

Publisher: Addison Wesley; 01 edition (19 Dec. 2013)

Language: English

ISBN-10: 0321898656

ISBN-13: 978-0321898654

Buy at Amazon.co.uk

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale

Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.

The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.

Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).

This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.

Learn

  • What data science is, how it has evolved, and how to plan a data science career
  • How data volume, variety, and velocity shape data science use cases
  • Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark
  • Data importation with Hive and Spark
  • Data quality, preprocessing, preparation, and modeling
  • Visualization: surfacing insights from huge data sets
  • Machine learning: classification, regression, clustering, and anomaly detection
  • Algorithms and Hadoop tools for predictive modeling
  • Cluster analysis and similarity functions
  • Large-scale anomaly detection
  • NLP: applying data science to human language

 

Paperback: 256 pages

Publisher: Addison Wesley; 01 edition (12 Dec. 2016)

Language: English

ISBN-10: 0134024141

ISBN-13: 978-0134024141

Product Dimensions: 17.8 x 1.8 x 23.1 cm

Buy at Amazon.co.uk

 

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.

Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run.

  • Understand Hadoop’s architecture from an administrator’s standpoint
  • Create simple and fully distributed clusters
  • Run MapReduce and Spark applications in a Hadoop cluster
  • Manage and protect Hadoop data and high availability
  • Work with HDFS commands, file permissions, and storage management
  • Move data, and use YARN to allocate resources and schedule jobs
  • Manage job workflows with Oozie and Hue
  • Secure, monitor, log, and optimize Hadoop
  • Benchmark and troubleshoot Hadoop

 

Paperback: 848 pages

Publisher: Addison Wesley (6 Dec. 2016)

Language: English

ISBN-10: 0134597192

ISBN-13: 978-0134597195

Product Dimensions: 17.8 x 4.8 x 23.1 cm

Buy at Amazon.co.uk

 

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in Apache Hadoop™ YARN, two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances.

YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.

You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it.

Coverage includes

  • YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem
  • Exploring YARN on a single node
  • Administering YARN clusters and Capacity Scheduler
  • Running existing MapReduce applications
  • Developing a large-scale clustered YARN application
  • Discovering new open source frameworks that run under YARN

 

Paperback: 336 pages

Publisher: AddisonWesley Professional; 01 edition (19 Mar. 2014)

Language: English

ISBN-10: 0321934504

ISBN-13: 978-0321934505

Product Dimensions: 17.8 x 2 x 22.6 cm

Buy at Amazon.co.uk

 

Visual Storytelling with D3: An Introduction to Data Visualization in JavaScript

Data-driven graphics are everywhere these days, from websites and mobile apps to interactive journalism and high-end presentations. Using D3, you can create graphics that are visually stunning and powerfully effective. Visual Storytelling with D3 is a hands-on, full-color tutorial that teaches you to design charts and data visualizations to tell your story quickly and intuitively, and that shows you how to wield the powerful D3 JavaScript library.

Drawing on his extensive experience as a professional graphic artist, writer, and programmer, Ritchie S. King walks you through a complete sample project—from conception through data selection and design. Step by step, you’ll build your skills, mastering increasingly sophisticated graphical forms and techniques. If you know a little HTML and CSS, you have all the technical background you’ll need to master D3.

This tutorial is for web designers creating graphics-driven sites, services, tools, or dashboards; online journalists who want to visualize their content; researchers seeking to communicate their results more intuitively; marketers aiming to deepen their connections with customers; and for any data visualization enthusiast.

Coverage includes

  • Identifying a data-driven story and telling it visually
  • Creating and manipulating beautiful graphical elements with SVG
  • Shaping web pages with D3
  • Structuring data so D3 can easily visualize it
  • Using D3’s data joins to connect your data to the graphical elements on a web page
  • Sizing and scaling charts, and adding axes to them
  • Loading and filtering data from external standalone datasets
  • Animating your charts with D3’s transitions
  • Adding interactivity to visualizations, including a play button that cycles through different views of your data
  • Finding D3 resources and getting involved in the thriving online D3 community

 

Paperback: 288 pages

Publisher: Addison Wesley; 01 edition (27 Aug. 2014)

Language: English

ISBN-10: 0321933176

ISBN-13: 978-0321933171

Product Dimensions: 18.8 x 1.6 x 23.3 cm

Buy at Amazon.co.uk