{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### LabML03\n", "\n", "The file https://github.com/masterfloss/datamovies/raw/main/moviesPT2.xlsx stores the total number of movies exhibited in Portugal. Each movie may have several exhibition year. Total exhibition year multiplied by the number of films is in the column 'Exhibition year'. Each movie has one release date.\n", "Reduce the number of dimensions and then create use cluster analysis to create groups of movies, according to its characteristics.\n", "\n", "Suggested steps:\n", "\n", "1.\tcreate a new dataframe only with numeric columns \n", "2.\tcreate two extra columns: 'revenue by spectator' and 'revenue by session'\n", "3.\tcalculate the cumulate the cumulative variance explained\n", "4.\tcreate the loading matrix\n", "5.\tname each one of the components\n", "6.\tcreate a dataframe with the PCA scores\n", "7.\tuse WCSS approach to reach the best number of clusters\n", "8.\tcalculate the best number of clusters\n", "9.\twhat is the silhouette score from the obtained clusters?\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Add all the libraries needed\n", "import pandas as pd\n", "from statsmodels.multivariate.pca import PCA\n", "from matplotlib import pyplot as plt\n", "from sklearn.cluster import KMeans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#create a new dataframe only with numeric columns \n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#create two extra columns: 'revenue by spectator' and 'revenue by session'\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#calclualte the cumulate the cumulative variance explained\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# create the loading matrix\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# name each one of the components\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#create a dataframe with the pca scores\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# use WCSS approach to reach the best number of clusters\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Calculate the best number of clusters\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#What is the silluette score from the obtained clusters?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }