{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## LabML04\n", "\n", "In order to create a model and extract knowledge from data, it is essential to prepare, merge and clean data. \n", "\n", "Suppose you have two datasets with information about movies played in theatres in Portugal (https://github.com/masterfloss/datamovies/raw/main/moviesPT3.xlsx) and ratings of movies obtained from IMDB (https://github.com/masterfloss/datamovies/raw/main/movies_ratings.tsv). \n", "\n", "The generic purpose is merging these data and creating models that allows to learn about the subject.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some useful links:\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html \n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import data \n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# import 'movies_ratings.tsv' and 'moviesPT3.xlsx' to separated dataframes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# rename \"ID Imdb\" to \"tconst\" in the second dataframe" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#merge the two dataframes verify diference between innerjoin and left. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# calculate correlation matrix\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# select only the columns with numbers and the title\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# remove rows with missing values. What are the possible alternatives " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Create a regeression model \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use a non supervised algorithm" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }