Python 3.6 Jupyter Notebook

Project planning for big data

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Notebook objectives:

By the end of this notebook, you will be expected to:

  • Understand project life cycle management in big data projects;
  • Categorize data in social physics;
  • Understand the interplay between social analytics and technology; and
  • Develop an awareness of the challenges in scaling analytical projects, or integrating these within large organizations.

List of exercises:

  • Exercise 1: List advantages of multimodal approaches.
  • Exercise 2: Identify various data categories and the information they reveal about individual behavior.
  • Exercise 3: Identify a social analytics use case based on recent technological innovations.
  • Exercise 4: List problems that may arise in managing large analytical projects, and suggest corrective actions.

Notebook introduction

While you have been introduced to the many technical concepts, tools, and technologies, little attention has been paid to people and processes. These are critical elements to consider when undertaking projects using data. Technological advances mean that tools and data collection methods are becoming more accessible, and that projects that previously required significant investment (in terms of human resources and technology) can be undertaken by a wider range of audiences. Your use cases and reasons for undertaking projects will vary, but you will still need to interact with a variety of stakeholders.

This notebook introduces a big data project methodology that will allow you to set up big data projects in your own social or commercial context. The existing project is used to demonstrate some of the key concepts that you can use as input in setting up your own projects in the future.

In the video content, Professor Pentland referred to specific concepts that can also be applied to your project. Some of these include the following:

  • Creating social context, which involves surrounding people with others who are thinking about trying certain things in order to get them to adopt those behaviors.
  • Playing with ideas does not change behavior.
  • Implementing multiple strategies, which may be necessary in order to deal with the various parties in the social context (early adopters and close networks).

Note:

While the content introduced in this course aims to provide you with insights typically applied in social analytics projects, you can also use these insights to ensure the successful completion of your project. For example, projects where multiple parties work on similar projects with a shared vision typically have a higher likelihood of success than those performed by individuals in isolation. From a systems point of view, this can be attributed to items such as shared tasks (data collected once and used multiple times). However, people often underestimate the impact of social context when executing projects. Keep all of the content introduced in the course in mind while working through the content presented in this notebook, and ensure that you apply these concepts when setting up or interacting with other parties in your future endeavors.

Refer to the abstract of the paper titled “Social fMRI: Investigating and shaping social mechanisms in the real world”, provided below. You are encouraged to work through the detail of the paper over the next two weeks.

We introduce the Friends and Family study, a longitudinal living laboratory in a residential community. In this study, we employ a ubiquitous computing approach, Social Functional Mechanism-design and Relationship Imaging, or Social fMRI, that combines extremely rich data collection with the ability to conduct targeted experimental interventions with study populations. We present our mobile-phone-based social and behavioral sensing system, deployed in the wild for over 15 months. Finally, we present three investigations performed during the study, looking into the connection between individuals’ social behavior and their financial status, network effects in decision making, and a novel intervention aimed at increasing physical activity in the subject population. Results demonstrate the value of social factors for choice, motivation, and adherence, and enable quantifying the contribution of different incentive mechanisms.

(Aharony et al. 2011)
Note:
This week’s notebooks will touch on various aspects of big data projects in general, and provide technical considerations to keep in mind. The Module 7 notebooks will focus on the abovementioned study by Aharony et al., and its results.
Note:
It is strongly recommended that you save and checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

1. Project overview

Taking a "People, Process, Technology, Data", high-level view of projects can be useful when planning or communicating projects at a conceptual level. Most project management methodologies will contain similar elements to those depicted in the following image (SAS Institute 2012).

1.1 People

Reuse existing skills, and budget time and training for your human resources. Many aspects of big data projects require additional learning curves and training. The nature of the types of analyses performed means that it is an iterative process, by definition, and many of the steps involved are extremely difficult to budget for using traditional project management methodologies. Ensure that you have the relevant skills available, and that you deal with the introduction of new tools and methodologies appropriately. Make sure that the analysts work in an environment where other members in the organization can support them, or that said analysts work with tools where there is a rich community of support (social context).

Typical roles include the following:

  • Stakeholders and decision makers: Within an organization, this includes chief information officers (CIOs), chief data officers (CDOs), governance and compliance officers, and business unit representatives. Within academia, this includes lecturers and study leads or department representatives of the institution.
  • Analytics professionals: This includes data scientists who create new functions and perform ad hoc analyses (typically using scripting languages such as R and Python, but may also use other existing tools where appropriate), and data scientists and analysts who repeat and refine analyses. This is similar to the above step, but also typically includes the use of more accessible tools such as Structured Query Language (SQL) and business intelligence (BI) tools. Business analysts and knowledge workers are supported by data analysts and data scientists, and typically use apps and end-to-end tools.
  • Infrastructure specialists: This includes architects, database administrators, and system owners.

1.2 Process

The “process” component of projects refers to the following aspects:

  1. Defining the processes and implementing standards to ensure that your activities are repeatable.
  2. Adhering to the standards (academic or commercial) required by your organization.
  3. Meeting the requests of your analytics needs.

Academics have a well-defined and rigorous process in place to deal with many of the issues that may arise. In the business and commercial world, many organizations are moving from waterfall-based approaches to more agile and "fail-fast" or "data-driven approaches".

1.3 Technology

The “technology” component of projects is characterized by the need to:

  • Choose the analytics platform components as required;
  • Reuse and be critical of limitations;
  • Consider open ecosystems and environments that can be used to speed up progress; and
  • Perform due diligence prior to implementing in production, as the checks and balances exist for a reason.

Be careful of moving to this step too quickly. Resolve the function and architecture before making technology choices.

1.4 Data

The “data” component of projects involves carrying out the following three steps:

  1. Identify, acquire, store, and provide access to the data required to support your analytics needs.
  2. Apply the appropriate governance and privacy protection standards.
  3. Use the data to uncover insights, trends, and patterns from data.

2. Change management

Approaching projects purely from a technical or analytical point of view often delivers underwhelming results. Be aware that once you reach an insight, you will also have to act on that insight, and that action needs to be performed in the social context of your organization or environment. This may be in the form of a simple change in an existing business process. However, in many cases, it may be necessary to implement significant changes in existing processes, new processes, or products and services that require some form of change management in order to implement the proposed changes. Review topics around organizational transformation and “data-driven approaches” for further information and guidance. Many of the approaches advocate iterative or “fail-fast” approaches in order to test the concept before embarking on full-scale projects. The approach you follow will be highly dependent on the type and size of the organization, and potentially on the tools and resources at your disposal. The majority of approaches contain similar elements, which will be explored in this notebook, and are listed below:

  • Phase 0: Pilot: Preparation and feasibility checks.
  • Phase 1: Project execution and maintenance.
  • Phase 2: Project termination.

3. Project initiation

In their research paper, Aharony et al. (2011) include the following paragraph to describe their vision:

Imagine the ability to place an imaging chamber around an entire community. Imagine the ability to record and display nearly every facet and dimension of behavior, communication, and social interaction among the members of the said community. Moreover, envision being able to conduct interventions in the community, while measuring their effect — by both automatic sensor tools as well as qualitative assessment of the individual subjects. Now, think about doing this for an entire year, while the members of the community go about their everyday lives.

(Aharony et al. 2011)

You may find inspiration in science, technology, or attempting to solve a business need, or your inspiration may be driven by personal interest. Whatever your reason for choosing to initiate a project, you still need to communicate the ideas to multiple stakeholders and plan your project to ensure success.

Starting with a clearly-stated and referenceable vision is a good way to ensure that you can communicate the project’s intention to multiple stakeholders, and get them excited and focused on the topic being studied. When setting the vision, keep the potential stakeholders and decision makers in mind. This is usually championed by an individual, but may form part of formal structures such as thesis proposals or workshops with multiple potential stakeholders. In terms of your documentation, abstracts, introductions, and overviews are typically completed last as these items are frequently revisited and updated throughout the course of the project.

The processes in academia, and specifically postgraduate studies, are set up to ensure that prospective students research their topic of interest, and clearly define their intended area of research. The research methodologies are carefully validated to ensure that all relevant aspects are addressed. Business applications tend to be less rigorous regarding processes, while having an increased focus on achieving monetary results. Work through the details of the referenced paper by Aharony et al. (2011), and think carefully about where each of the described sections may be applicable to your potential use cases.

4. Research and context

Review Section 2 of the referenced paper: Related work and context.

Conducting your own research will enable you to benefit from incorporating the latest research in your analytics efforts. At this stage, many of your questions may have already been addressed to some extent, allowing you to build on the work completed by others in order to accelerate your efforts. However, other questions may not be answered directly.

5. Project framework

You can use project methodologies that are applicable to your industry, or ones that you are familiar with, but it is highly advisable to keep a few topics in mind.

In your research, you will likely encounter subject material relevant to your project, which you can use as input. This may include subject matter or business value frameworks, project plans, typical questions, and plans to answer these questions.

5.1 Value perspective

Focusing on the value perspective from an early point helps to keep you focused. This involves considering the following questions:

  • How do you define success?
  • What is the intended outcome?
  • Will results be monitized?
  • How will you measure social impact?
  • Will you be using incentives (such as gamification)?

5.2 Accelerators

Accelerators may include using programming libraries – such as Pandas for data analysis, or Bandicoot for the interrogation of mobile phone metadata – but you will also be able to leverage the content from studies and service providers in order to accelerate your efforts.

5.3 Skills review

New technological options, and applying the best tools or processes ("fit for purpose"), may achieve results in dramatically-reduced timeframes or at a fraction of the cost typically associated with large-scale studies. However, you should start building a view of the resources (available and required) at an early stage.

5.4 Communicating your proposal

Lastly, you should decide on a strategy to visualize and communicate your project.

Review Section 3 of the referenced paper. In this section, there is a clear example of how to frame your intentions in a manner that can be easily communicated to stakeholders.

6. Methodology

Review Section 4 of the referenced paper.

Exercise 1 Start.

Instructions

List two advantages of the multimodal approach that are similar to the approach discussed in the referenced paper.

Your markdown answer here.


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

The paper describes the methodology employed in detail. When embarking on your journey, there will be additional items to consider which typically do not make it into the final research paper or the final product. Some of these items (which are outside of the scope of the course) are included here for your consideration.

6.1 Data collection

In the video content, earlier in the course, Arek Stopczynski referred to data collection as being expensive. Carefully review the data sources already available, as well as potential new sources of data. You will need to balance the need for trusted, high-accuracy data or low-grain data with the costs and overheads associated with obtaining the data. In many cases, your pilot project will focus on easier-to-obtain data, and you may need to use proxies or subsets of data to demonstrate the usefulness of the concept being studied. Many of the questions that you will need to answer may potentially be answered using a small subset of data (this concept is known as the Pareto principle). There would typically be a number of use cases for the same set of data. Think of social network data. While many organizations, or units within organizations, are excited about the prospect of using this rich source of data, few are able to obtain value from it.

6.2 Data preparation

Data preparation can be a tedious process, and typically requires more time than is budgeted for. Accelerators aid in transformation, and the governance of data can help you to reduce ongoing required efforts. When thinking about the creation of data products or implementing production systems, you should typically want to automate these steps, in order to free up capacity of your data scientists, ensuring that they can spend their time on useful tasks rather than repetitive ones.

6.3 Analytics

In addressing analytics concerns, you would build up a "toolbox" as you practise and implement your analytics capabilities. These may include:

  • Statistical analyses and predictive modeling, including generalized linear models (GLMs), K-nearest neighbor (KNN) classification algorithms, and naive bayes models to train models for churn prediction;
  • Pattern and distribution matching;
  • Time series analysis;
  • Segmentation;
  • Graph and cluster analysis;
  • Data transformation for advanced analysis or input to existing tools;
  • Text analytics to derive patterns and features;
  • Sentiment analysis;
  • Geospatial analysis;
  • Machine and deep learning techniques; and
  • Visualization of random forests and single decision trees.


Exercise 2 Start.

Instructions

Section 4.2 of the referenced paper contains a number of data sources such as mobile phone call and sensor records, surveys, purchasing behavior, and Facebook data. Identify the major groups into which these data can be categorized, and describe what these types of records would typically tell you about the behavior of the individuals.

Hint: Refer to the Module 1 and 2 content in which the difference between how you want to be perceived and what you commit to is described.

Your markdown answer here.


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

7. System architecture

Review: Section 5 of the referenced paper.

The focus of the provided paper is on the mobile phone platform. It defines the sensors used, and describes the data formats, data movement protocols, and the data storage structure.

The system architecture is a conceptual model that defines and describes the technical components contained in the system structure, as well as the system’s behavior. It is typically used for planning and communication purposes when the components need to be described to interested parties.

Usually, you would want to create a system that delivers fast and scalable results, and the components selected will vary greatly, based on available existing resources as well as new requirements and options. Often, you would start with a conceptual or logical model that describes the functions and flow of information in the system. At a later stage of the process, you would revisit granular details such as the physical implementation and specific components that were used.

Many people get stuck when trying to select the technology first, then extending and revisiting the logical and conceptual models at later stages. This approach carries the risk of diluting the project’s purpose, and getting lost in technicalities rather than focusing on the defined purpose.

Privacy considerations will be addressed in the second notebook of this module. However, at this stage, it is important to note that traditional methods that are applied to anonymize data in relational or file-based systems are usually no longer adequate. This necessitates approaches such as “privacy by design” to ensure that you deal with sensitive and potentially-sensitive data appropriately. Consider data collected for internal and external use. Your internal researchers may require access to granular and unmasked data, while other business users or applications may only need to access anonymized data sets.

While the technical components of systems architecture fall outside of the scope of this course, the following section briefly outlines a number of items that you should carefully consider when defining your architecture.

7.1 Dealing with data

Consider the following points regarding dealing with data before defining your target system architecture:

  1. Sourced data can be stored as files or in dedicated structures (such as databases): These databases can either be relational or non-relational, and their physical implementation should be based on function and cost considerations. A number of implementation alternatives exist, and you can review two of the most popular structures – logical data warehouses and data lakes.
  2. Interacting with the data can be done via a wide variety of tools: The choice of technology will determine the available options. The data can be accessed via BI tools, Excel, SQL, or application program interfaces (APIs). These concepts are not described in detail here, however, you are encouraged to perform your own research should the topics be of interest to you.
  3. Data governance and data definitions need to be maintained over time, and are typically of use to multiple stakeholders in your organization: While controlling these in the analytics phase often ensures quick time-to-action, it is important to hand these over to stable structures to avoid having to repeat the actions on an ongoing basis. The advantage of this approach is that you will start building rich, "trusted" data structures that you can access for future use cases. In addition to this, your insights can start to form part of the internal reporting structures for your organization (if they exist).
  4. Methodologies applied in traditional BI approaches are usually based on waterfall methodologies: Analytics projects tend to be better suited to agile or "fail-fast" methodologies that promote quicker time-to-value and iterative approaches. Once you have determined that your analysis holds value, you can decide on the appropriate action to implement on a more permanent basis. This is instead of committing to large projects before being able to ascertain or confirm that the project is feasible and can be implemented.


Exercise 3 Start.

Instructions

Provide a short description of a recent technological component or trend that you think is significant. This should be selected based on your personal experience or interests, as many of the topics on which technological components or trends are based fall outside of the scope of this course.

  • State your hypothetical social analytics use case (for context).
  • State your chosen technological component or trend, and provide a short, relevant motivation for selecting this component. The description should contain at least five key points, and can include benefits or potential risks associated with the technology or trend.

Hint: APIs, mobile applications, cloud-based computing, and interactive computing are examples of topics that are not covered in detail in this course. Should you wish to do so, you can also base your answer on technological concepts introduced in this course.

Your markdown answer here.


Exercise 3 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

8. Project checklist

When carrying out the phases of a project (plan), it is helpful to make use of a checklist to ensure each step in the plan is executed, and each aspect accounted for. The following sections outline typical phases of a project plan, and highlight some of the items that should be added to your checklists.

8.1 Phase 0

Research the topic under review as well as related topics that can be used to add context. Once this is done, you will need to define your project framework, by carrying out the following steps:

  1. Define the vision and scope of the project.
  2. Review your value perspective, and list value drivers. How will you measure success (monetization, ROI, etc.)?
  3. Define security and access policies.
  4. Identify use cases and stakeholders.
  5. Perform feasibility checks.
  6. Establish high-level conceptual architecture.

Pilot projects typically focus on prioritized use cases. In light of this, it is important to do the following:

  • Describe and prioritize use cases. This is often based on business impact, timing (availability and access to data), complexity, and effort.
  • Source and transform data.

Data analysis typically follows the same steps as listed in previous modules with short iterations.

8.2 Phase 1

Create a roadmap or an implementation plan. This can be done by doing the following:

  • Planning the activities and resources required to execute the plan.
  • Completing the phases of the data analysis cycle, which are collection, pre-processing, hygiene, analysis (a combination of analytic techniques), visualization, interpretation, and intervention (which may include operationalized insights or a policy definition).

8.3 Phase 2

Project termination generally requires dependencies to be removed, and data to be dealt with appropriately. Ensure that you are aware of the legal requirements when disposing of or archiving sensitive data. Architecture documents (describing both systems and interactions between different systems) can be of significant value during this phase.

Note: In the second part of this notebook, which you will review in Module 7, you will revisit the technical sections (Sections 6 to 11), as well as the conclusion of the referenced paper.


Exercise 4 Start.

Instructions

Project management methodologies and best practices can ensure that you get started quickly. They can, however, also introduce a significant amount of overheads. These "best practices" can be used to guide you in unknown areas, and help you deal with the complexities of running large projects.

List two problems that you expect to experience when attempting to scale small-scale or ad hoc analyses to large-scale implementations; or when incorporating these analyses within large organizations. Propose simple corrective actions for each.

Note: These concepts are not dealt with in detail in this course. The aim is to look for insights based on the tools and technologies introduced in this course, rather than attempting to test your knowledge of the various frameworks that exist.

Your markdown answer here.


Exercise 4 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

9. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.

10. References

Aharony, Nadav, Wei Pan, Cory Ip, Inas Khayal, Alex Pentland. 2011. “SocialfMRI: Investigating and shaping social mechanisms in the real world.” Pervasive and Mobile Computing 7:643-659.

SAS Institute. 2012. “Analytics Infrastructure: 15 Considerations.” Accessed September 10. http://blogs.sas.com/content/datamanagement/2012/05/09/analytics-infrastructure-15-considerations/.


In [ ]: