Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData 101

PyData 101

(PyData Seattle Keynote; July 6 2017)

The PyData ecosystem is vast and powerful, but it can be overwhelming to newcomers. In this talk I outline some of the history of *why* the Python data science space is the way it is, as well as *what* tools and techniques you should focus on to get started for your own problems. Video: https://www.youtube.com/watch?v=DifMYH3iuFw

Jake VanderPlas

July 06, 2017
Tweet

More Decks by Jake VanderPlas

Other Decks in Technology

Transcript

  1. PyData 101 Jake VanderPlas @jakevdp PyData Seattle 2017 Slides: http://speakerdeck.com/jakevdp/pydata-101

    Everything you need to know to get started in data science in Python.
  2. What is Jupyter? What visualization library should I use? Where

    should I start for Machine Learning? Deep Learning? How should I install Python? What is this Cython thing I keep hearing about? Should I use NumPy or Pandas? Why are there so many ways to do X? Conda envs vs. Jupyter kernels… help! Why isn’t [x] just built-in to Python? What is conda? Is pip the same thing? How do I load this CSV? How do I make interactive graphics? Virtualenv or venv or conda envs? Why is matplotlib so… painful!?! My code is slow… how do I make it faster? How can I parallelize computations?
  3. Why is the PyData space the way it is? ~

    What is the best tool for my job?
  4. Python was created in the 1980s as a teaching language,

    and to “bridge the gap between the shell and C” 1 1. Guido Van Rossum The Making of Python
  5. “I thought we'd write small Python programs, maybe 10 lines,

    maybe 50, maybe 500 lines — that would be a big one” Guido Van Rossum The Making of Python
  6. “Scientists... work with a wide variety of systems ranging from

    simulation codes, data analysis packages, databases, visualization tools, and home-grown software-each of which presents the user with a different set of interfaces and file formats. As a result, a scientist may spend a considerable amount of time simply trying to get all of these components to work together in some manner...” - David Beazley Scientific Computing with Python (ACM vol. 216, 2000) 1990s: The Scripting Era
  7. 1990s: The Scripting Era 2000s: The SciPy Era Motto: “Python

    as Alternative to MatLab” * yes, this is overly simplified . .
  8. “I had a hodge-podge of work processes. I would have

    Perl scripts that called C++ numerical routines that would dump data files, and I would load them up into MatLab to plot them. After a while I got tired of the MatLab dependency… so I started loading them up in GnuPlot.” -John Hunter creator of Matplotlib SciPy 2012 Keynote 2000s: The SciPy Era
  9. “Prior to Python, I used Perl (for a year) and

    then Matlab and shell scripts & Fortran & C/C++ libraries. When I discovered Python, I really liked the language... But, it was very nascent and lacked a lot of libraries. I felt like I could add value to the world by connecting low-level libraries to high-level usage in Python.” - Travis Oliphant creator of NumPy & SciPy via email, 2015 2000s: The SciPy Era
  10. 2000s: The SciPy Era “I remember looking at my desk,

    and seeing all the books on languages I had. I literally had a stack with books on C, C++, Unix utilities (awk/sed/sh/etc), Perl, IDL manuals, the Mathematica book, Make printouts, etc. I realized I was probably spending more time switching between languages than getting anything done..” - Fernando Perez creator of IPython via email, 2015
  11. Released circa 2002 Released circa 2000 Released circa 2001 2000s:

    The SciPy Era 1995 2002 Numarray Numeric (Early array libraries) Key Software Development:
  12. Com putation Visualization Shell Originally, the three projects each had

    much wider scope: 2000s: The SciPy Era Numarray Numeric Array Manipulation
  13. Shell Com putation Visualization With time, the projects narrowed their

    focus: 2000s: The SciPy Era Unified Array Library Underneath
  14. 1990s: The Scripting Era 2000s: The SciPy Era 2010s: The

    PyData Era * yes, this is overly simplified . .
  15. 1990s: The Scripting Era 2000s: The SciPy Era 2010s: The

    PyData Era Motto: “Python as Alternative to R” * yes, this is overly simplified . .
  16. 2010s: The PyData Era “I had a distinct set of

    requirements that were not well-addressed by any single tool at my disposal: - Data structures with labeled axes . . . - Integrated time series functionality . . . - Arithmetic operations and reductions . . . - Flexible handling of missing data - Merge and other relational operations . . . I wanted to be able to do all these things in one place, preferably in a language well-suited to general purpose software development” - Wes McKinney creator of Pandas (in Python for Data Analysis)
  17. Key Software Development: 2010s: The PyData Era 2011: Labeled data

    2010: Machine Learning 2012: Packaging 2012: Compute Environment 2015: Multi-langage support
  18. 1990s: The Scripting Era 2000s: The SciPy Era 2010s: The

    PyData Era Motto: “Python as Alternative to R” Motto: “Python as Alternative to MatLab” Motto: “Python as Alternative to Bash” * yes, this is all overly simplified . . .
  19. People want to use Python because of its intuitiveness, beauty,

    philosophy, and readability. So people build Python packages that incorporate lessons learned in other tools & communities.
  20. We must recognize: Python is not a data science language.

    Python is a general-purpose language, and this is one of its great strengths for data science.
  21. Strength: HUGE space of capability! Weakness: Where do you start

    ?!?!?!? Think of Python as a Swiss-Army-Knife:
  22. Installation Conda is a cross-platform package and dependency manager, focused

    on Python for scientific and data-intensive computing, It comes in two flavors: - Miniconda is a minimal install of the conda command-line tool - Anaconda is miniconda plus hundreds of common packages. I recommend Miniconda. http://conda.pydata.org/
  23. Installation Anaconda and Miniconda are both available for a wide

    range of operating systems. http://conda.pydata.org/
  24. $ bash ~/Downloads/Miniconda3-latest-MacOSX-x86_64.sh Welcome to Miniconda3 4.3.21 (by Continuum Analytics,

    Inc.) In order to continue the installation process, please review the license agreement. Please, press ENTER to continue >>> Installation Miniconda is a lightweight installation (~25MB) that gives you access to the conda package management tool. It creates a sandboxed Python installation, entirely disconnected from your system Python. http://conda.pydata.org/
  25. $ which conda /Users/jakevdp/anaconda/bin/conda $ which python /Users/jakevdp/anaconda/bin/python $ python

    Python 3.5.1 |Continuum Analytics, Inc.| (default ... Type "help", "copyright", "credits" or "license" ... >>> print("hello world") hello world Installation Both conda and python now point to the executables installed by miniconda. http://conda.pydata.org/
  26. $ conda install numpy scipy pandas matplotlib jupyter Fetching package

    metadata ......... Solving package specifications: . Package plan for installation in environment /Users/jakevdp/anaconda/: The following NEW packages will be INSTALLED: appnope: 0.1.0-py36_0 bleach: 1.5.0-py36_0 cycler: 0.10.0-py36_0 decorator: 4.0.11-py36_0 Installation Installation of new packages can be done seamlessly with conda install http://conda.pydata.org/
  27. $ conda create -n py2.7 python=2.7 numpy=1.13 scipy Fetching package

    metadata ......... Solving package specifications: . Package plan for installation in environment /Users/jakevdp/anaconda/envs/py2.7: The following NEW packages will be INSTALLED: mkl: 2017.0.3-0 numpy: 1.13.0-py27_0 openssl: 1.0.2l-0 pip: 9.0.1-py27_1 Installation New sandboxed environments can be created with specific versions of Python and its packages. Here we create an environment named py2.7 with Python 2.7 http://conda.pydata.org/
  28. $ source activate python2.7 (python2.7) $ which python /Users/jakevdp/anaconda/envs/python2.7/bin/python (python2.7)

    $ python --version Python 2.7.11 :: Continuum Analytics, Inc. Installation By “activating” the environment, we can now use this different Python version with a different set of packages. You can create as many of these environments as you’d like. http://conda.pydata.org/
  29. Installation I tend to use conda envs for just about

    everything, particularly when testing development versions of projects I contribute to. $ conda env list # conda environments: # astropy-dev /Users/jakevdp/anaconda/envs/astropy-dev jupyterlab /Users/jakevdp/anaconda/envs/jupyterlab python2.7 /Users/jakevdp/anaconda/envs/python2.7 python3.3 /Users/jakevdp/anaconda/envs/python3.3 python3.4 /Users/jakevdp/anaconda/envs/python3.4 python3.5 /Users/jakevdp/anaconda/envs/python3.5 python3.6 /Users/jakevdp/anaconda/envs/python3.6 scipy-dev /Users/jakevdp/anaconda/envs/scipy-dev sklearn-dev /Users/jakevdp/anaconda/envs/sklearn-dev vega-dev /Users/jakevdp/anaconda/envs/vega-dev root /Users/jakevdp/anaconda http://conda.pydata.org/
  30. Installation 1. https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/ So… what about pip? In brief: “pip

    installs python packages within any environment; conda installs any package within conda environments” For many more details on the distinctions, see my blog post, Conda: Myths and Misconceptions1
  31. Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks

    from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/
  32. Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks

    from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/
  33. Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks

    from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/
  34. Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks

    from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/
  35. Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks

    from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/
  36. Coding Environment: http://jupyter.org/ As of this summer, JupyterLab will be

    available: turning the notebook into a full-featured IDE.
  37. Numerical Computation: NumPy provides the ndarray object which is useful

    for storing and manipulating numerical data arrays. import numpy as np x = np.arange(10) print(x) [0 1 2 3 4 5 6 7 8 9] Arithmetic and other operations are performed element-wise on these arrays: print(x * 2 + 1) [ 1 3 5 7 9 11 13 15 17 19] http://www.numpy.org/
  38. Numerical Computation: Also provides essential tools like pseudo-random numbers, linear

    algebra, Fast Fourier Transforms, etc. M = np.random.rand(5, 10) # 5x10 random matrix u, s, v = np.linalg.svd(M) print(s) [ 4.22083 1.091050 0.892570 0.55553 0.392541] x = np.random.randn(100) # 100 std normal values X = np.fft.fft(x) print(X[:4]) # first four entries [ -7.932434 +0.j -16.683935 -3.997685j 3.229016+16.658718j 2.366788-11.863747j] http://www.numpy.org/
  39. Numerical Computation: Key to using NumPy (and general numerical code

    in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = np.empty(x.shape) for i in range(len(x)): y[i] = 2 * x[i] + 1 1 loop, best of 3: 6.4 s per loop If you write Python like C, you’ll have a bad time: http://www.numpy.org/
  40. Numerical Computation: Key to using NumPy (and general numerical code

    in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = 2 * x + 1 10 loops, best of 3: 58.6 ms per loop Use vectorization for readability and speed ~ 100x speedup! http://www.numpy.org/
  41. Numerical Computation: Key to using NumPy (and general numerical code

    in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = 2 * x + 1 10 loops, best of 3: 58.6 ms per loop Use vectorization for readability and speed https://www.youtube.com/watch?v=EEUXKG97YRw https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015 ~ 100x speedup! For a more comlete intro to vectorization in NumPy, see Losing Your Loops: Fast Numerical Computation in Python (my talk at PyCon 2015)
  42. Labeled Data: Pandas provides a DataFrame object which is like

    a NumPy array, but has labeled rows and columns: import pandas as pd df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) print(df) x y 0 1 4 1 2 5 2 3 6 http://pandas.pydata.org
  43. Labeled Data: Like NumPy, arithmetic is element-wise, but you can

    access and augment the data using column name: df['x+2y'] = df['x'] + 2 * df['y'] print(df) x y x+2y 0 1 4 9 1 2 5 12 2 3 6 15 http://pandas.pydata.org
  44. Labeled Data: Pandas excels in reading data from disk in

    a variety of formats. Start here to read virtually any data format! # contents of data.csv name, id peter, 321 paul, 605 mary, 444 name id 0 peter 321 1 paul 605 2 mary 444 df = pd.read_csv('data.csv') print(df) http://pandas.pydata.org
  45. Labeled Data: Pandas also provides fast SQL-like grouping & aggregation:

    id val 0 A 1 1 B 2 2 A 3 3 B 4 df = pd.DataFrame({'id': ['A', 'B', 'A', 'B'], 'val': [1, 2, 3, 4]}) print(df) val id A 4 B 6 grouped = df.groupby('id').sum() print(grouped) http://pandas.pydata.org
  46. Visualization: Matplotlib was developed as a Pythonic replacement for MatLab;

    thus MatLab users should find it quite familiar: import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 1000) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x)) http://www.matplotlib.org/
  47. Visualization Beyond Matplotlib . . . Pandas offers a simplified

    Matplotlib Interface: data = pd.read_csv('iris.csv') data.plot.scatter('petalLength', 'petalWidth') http://pandas.pydata.org
  48. Visualization Beyond Matplotlib . . . Seaborn is a package

    for statistical data visualization seaborn.pairplot(data, hue='species') http://seaborn.pydata.org/
  49. (ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)')) + geom_point()) + stat_smooth(method='lm') + facet_wrap('~gear'))

    Visualization Beyond Matplotlib . . . plotnine: grammar of graphics in Python http://plotnine.readthedocs.io/
  50. Visualization Beyond Matplotlib . . . Viz in Python is

    a huge and rapidly-developing space: See my PyCon 2017 talk, Python’s Visualization Landscape https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017 https://www.youtube.com/watch?v=FytuB8nFHPQ
  51. Numerical Algorithms: SciPy SciPy contains almost too many to demonstrate:

    e.g. scipy.sparse sparse matrix operations scipy.interpolate interpolation routines scipy.integrate numerical integration scipy.spatial spatial metrics & distances scipy.stats statistical functions scipy.optimize minimization & optimization scipy.linalg linear algebra scipy.special special mathematical functions scipy.fftpack Fourier & related transforms Most functionality comes from wrapping Netlib & related Fortran libraries, meaning it is blazing fast. http://www.scipy.org/
  52. Numerical Algorithms: SciPy import matplotlib.pyplot as plt import numpy as

    np from scipy import special, optimize x = np.linspace(0, 10, 1000) opt = optimize.minimize(special.j1, x0=3) plt.plot(x, special.j1(x)) plt.plot(opt.x, special.j1(opt.x), marker='o', color='red') http://www.scipy.org/
  53. Machine Learning: $ conda install scikit-learn http://scikit-learn.org/ Scikit-learn features a

    well-defined, extensible API for the most popular machine learning algorithms:
  54. http://scikit-learn.org/ x = 10 * np.random.rand(100) y = np.sin(x) +

    0.1 * np.random.randn(100) plt.plot(x, y, '.k') Make some noisy 1D data for which we can fit a model: Machine Learning with scikit-learn
  55. http://scikit-learn.org/ from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(x[:, np.newaxis],

    y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a random forest regression: Machine Learning with scikit-learn
  56. Machine Learning with scikit-learn http://scikit-learn.org/ from sklearn.svm import SVR model

    = SVR() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a support vector regression:
  57. Machine Learning with scikit-learn http://scikit-learn.org/ from sklearn.svm import SVR model

    = SVR() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a support vector regression: Scikit-learn’s strength: provides a common API for the most common machine learning methods.
  58. Parallel Computation: $ conda install dask http://dask.pydata.org/ Dask is a

    lightweight tool for creating task graphs that can be executed on a variety of backends.
  59. Parallel Computation: http://dask.pydata.org/ import numpy as np a = np.random.randn(1000)

    b = a * 4 b_min = b.min() print(b_min) -13.2982888603 Typical data manipulation with NumPy:
  60. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask
  61. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask “Task Graph”
  62. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask b2_min.compute() -13.298288860312757
  63. Code Optimization $ conda install numba http://numba.pydata.org/ Numba is a

    bytecode compiler that can convert Python code to fast LLVM code targeting a CPU or GPU. Numba
  64. Code Optimization http://numba.pydata.org/ Numba Simple iterative functions tend to be

    slow in Python: def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100 loops, best of 3: 2.73 ms per loop
  65. Code Optimization http://numba.pydata.org/ Numba import numba @numba.jit def fib(n): a,

    b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100000 loops, best of 3: 6.06 µs per loop With a simple decorator, code can be ~1000x as fast! ~ 500x speedup!
  66. Code Optimization http://numba.pydata.org/ Numba Numba achieves this by just-in-time (JIT)

    compilation of the Python function to LLVM byte-code. import numba @numba.jit def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100000 loops, best of 3: 6.06 µs per loop With a simple decorator, code can be ~1000x as fast! ~ 500x speedup!
  67. Code Optimization $ conda install cython http://www.cython.org/ Cython is a

    superset of the Python language that can be compiled to fast C code.
  68. Code Optimization http://www.cython.org/ Again, returning to our fib function: #

    python code def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a 100 loops, best of 3: 2.73 ms per loop %timeit fib(10000)
  69. Code Optimization http://www.cython.org/ Cython compiles the code to C, giving

    marginal speedups without even changing the code: %%cython def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a 100 loops, best of 3: 2.42 ms per loop %timeit fib(10000) ~ 10% speedup!
  70. Code Optimization http://www.cython.org/ Using cython’s syntactic sugar to specify types

    for the compiler leads to much better performance: %%cython def fib(int n): cdef int a = 0, b = 1 for i in range(n): a, b = b, a + b return a 100000 loops, best of 3: 5.93 µs per loop %timeit fib(10000) ~ 500x speedup!
  71. 1990s: The Scripting Era 2000s: The SciPy Era 2010s: The

    PyData Era “Python as Alternative to R” “Python as Alternative to MatLab” “Python as Alternative to Bash”
  72. 1990s: The Scripting Era 2000s: The SciPy Era 2010s: The

    PyData Era “Python as Alternative to R” “Python as Alternative to MatLab” “Python as Alternative to Bash” 2020s: ???