Linux Plumbers Conference 2018

America/Vancouver

Description

November 13-15 2018, Vancouver, BC

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.  LPC 2018 will be held November 13-15 in Vancouver, BC, Canada.  We are looking forward to seeing you there!

    • 09:00 12:30
      Containers MC Junior-Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior-Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

      • 10:30
        Break 30m
    • 09:00 12:30
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 09:00 12:30
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        Improving Graphics Interactivity - It's All in the Timing 45m

        Interactive applications, which includes everything from real time
        games through flight simulators and virtual reality environments,
        place strong real-time requirements on the whole computing environment
        to ensure that the correct data are presented to the user at the
        correct time. This requires two things; the first is that the time
        when the information will be displayed be known to the application so
        that the correct contents can be computed, and second that the frame
        actually be displayed at that time.

        These two pieces of information are managed inconsistently through the
        graphics stack, making it difficult for applications to provide a
        smooth animation experience to users. And because of the many APIs
        which lie between the application rendering using OpenGL or Vulkan and
        the underlying hardware, a failure to correctly handle things at
        any point along the chain will result in stuttering results.

        Fixing this requires changes throughout the system, from making the
        kernel provide better control and information about the queuing and
        presentation of images through changes in composited window systems to
        ensure that images are displayed at the target time, and that
        the actual time of presentation is reported back to applications and
        finally additions to rendering APIs like Vulkan to expose control
        over image presentation times and feedback about when images ended up
        being visible to the user.

        This presentation will first demonstrate the effects of poor display
        timing support inherent in the current graphics stack, talk about the
        different solutions required at each level of the system and finally
        show the working system.

        Speaker: Keith Packard (Hewlett Packard Enterprise)
      • 09:45
        The end of time, 19 years to go 45m

        Software that uses a 32-bit integer to represent seconds since the Unix epoch of Jan 1 1970 is affected by that variable overflowing on Jan 19 2038, often in a catastrophic way. Aside from most 32-bit binaries that use timestamps, this includes file systems (e.g. ext3 or xfs), file formats (e.g. cpio, utmp, core dumps), network protocols (e.g. nfs) and even hardware (e.g. real-time clocks or SCSI adapters).

        Work has been going on to avoid that overflow in the Linux kernel, with hundreds of patches reworking drivers, file systems and the user space interfaces including over 50 affected system calls.

        With much of this activity getting done during 2018, it's time to give an update on what has been achieved in the kernel, what parts still remain to be solved, and how we will proceed to solve this in user space, and how to use the work in long-living product deployments.

        Speaker: Mr. Arnd Bergmann (Linaro)
      • 10:30
        Break 30m
      • 11:00
        Mind the gap - between real-time Linux and real-time theory 45m

        It is common to see Linux being used on real-time research projects. However, the assumptions made in papers are very often unrealistic. In contrast, researchers argue that the main metric used on PREEMPT RT, although useful, is an oversimplification of the problem.

        It is a consensus that the academic research helps to improve Linux’s state-of-art, and vice-versa. So how can we reduce the gap between these task forces? The real-time researchers start papers with a clear definition of the task model. But we do not have a task model for Linux: this is where the gap is.

        This talk presents effort on establishing the task model for the PREEMPT RT Linux. Starting with the description of the operations that influence the timing behavior of tasks, passing by the definition of the relationships of the operations. Finally, the outcomes for Linux, like new metrics for the PREEMPT RT and a model validator (a lockdep like verificator, but for preemption) for the kernel, are discussed.

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
      • 11:45
        SCHED_DEADLINE desiderata and slightly crazy ideas 45m

        The SCHED_DEADLINE scheduling policy is all but done. Even though it existed in mainline for several years, many features are yet to be implemented; some are already available as immature code, some others only exist as wishes.

        In this talk Juri Lelli and Daniel Bristot De Oliveira will give the audience in-depth details of what’s missing, what’s under development and what might be desirable to have. The intent is to provide as much information as possible to people attending, so that a fruitful discussion might be held later on during hallway and micro conference sessions.

        Examples of what is going to be presented are:

        • Non-root usage
        • CGroup support
        • Re-working RT Throttling to use DL servers
        • Better Priority Inheritance (AKA proxy execution)
        • Schedulability improvements
        • Better support for tracing
        Speakers: Juri Lelli (Red Hat, Inc.), Daniel Bristot de Oliveira (Red Hat, Inc.)
    • 09:00 18:00
      Networking Track Junior-Ballroom-C (Sheraton Vancouver Wall Center)

      Junior-Ballroom-C

      Sheraton Vancouver Wall Center

      67

      A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.

      Official Networking Track website: http://vger.kernel.org/lpc-networking.html

      • 09:00
        Welcome 20m

        Openning welcome, announcements, etc.

      • 09:20
        XDP - Challenges and Future Work 35m

        XDP already offers rich facilities for high performance packet
        processing, and has seen deployment in several production systems.
        However, this does not mean that XDP is a finished system; on the
        contrary, improvements are being added in every release of Linux, and
        rough edges are constantly being filed down. The purpose of this talk is
        to discuss some of these possibilities for future improvements,
        including how to address some of the known limitations of the system. We
        are especially interested in soliciting feedback and ideas from the
        community on the best way forward.

        The issues we are planning to discuss include, but are not limited to:

        • User experience and debugging tools: How do we make it easier for
          people who are not familiar with the kernel or XDP to get to grips
          with the system and be productive when writing XDP programs?

        • Driver support: How do we get to full support for XDP in all drivers?
          Is this even a goal we should be striving for?

        • Performance: At high packet rates, every micro-optimisation counts.
          Things like inlining function calls in drivers are important, but also
          batching to amortise fixed costs such as DMA mapping. What are the
          known bottlenecks, and how do we address them?

        • QoS and rate transitions: How should we do QoS in XDP? In particular,
          rate transitions (where a faster link feeds into a slower) are
          currently hard to deal with from XDP, and would benefit from, e.g.,
          Active Queue Management (AQM). Can we adapt some of the AQM and QoS
          facilities in the regular networking stack to work with XDP? Or should
          we do something different?

        • Accelerating other parts of the stack: Tom Herbert started the
          discussion on accelerating transport protocols with XDP back in 2016.
          How do we make progress on this? Or should we be doing something
          different? Are there other areas where we can extend XDPs processing
          model to provide useful accelerations?

        Speakers: Jesper Dangaard Brouer (Red Hat), Toke Høiland-Jørgensen (Karlstad University)
      • 09:55
        Leveraging Kernel Tables with XDP 35m

        XDP is a framework for running BPF programs in the NIC driver to allow
        decisions about the fate of a received packet at the earliest point in
        the Linux networking stack. For the most part the BPF programs rely on
        maps to drive packet decisions, maps that are managed for example by a
        userspace agent. This architecture has implications on how the system is
        configured, monitored and debugged.

        An alternative approach is to make the kernel networking tables
        accessible by BPF programs. This approach allows the use of standard
        Linux APIs and tools to manage networking configuration and state while
        still achieving the higher performance provided by XDP. An example of
        providing access to kernel tables is the recently added helper to allow
        IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites
        such as FRR manage the FIB tables, and the XDP packet path benefits by
        automatically adapting to the FIB updates in real time. While a huge
        first step, a FIB lookup alone is not sufficient for general networking
        deployments.

        This talk discusses the advantages of making kernel tables available to
        XDP programs to create a programmable packet pipeline, what features
        have been implemented as of October 2018, key missing features, and
        current challenges.

        Speaker: David Ahern (Cumulus Networks)
      • 10:30
        Morning Break 30m
      • 11:00
        Building Socket-aware BPF Programs 35m

        Over the past several years, BPF has steadily become more powerful in multiple
        ways: Through building more intelligence into the verifier which allows more
        complex programs to be loaded, and through extension of the API such as by
        adding new map types and new native BPF function calls. While BPF has its roots
        in applying filters at the socket layer, the ability to introspect the sockets
        relating to traffic being filtered has been limited.

        To build such awareness into a BPF helper, the verifier needs the ability to
        track the safety of the calls, including appropriate reference counting upon
        the underlying socket. This talk walks through extensions to the verifier to
        perform tracking of references in a BPF program. This allows BPF developers to
        extend the UAPI with functions that allocate and release resources within the
        execution lifetime of a BPF program, and the verifier will validate that the
        resources are released exactly once prior to program completion.

        Using this new reference tracking ability in the verifier, we add socket lookup
        and release function calls to the BPF API, allowing BPF programs to safely find
        a socket and build logic upon the presence or attributes of a socket. This can
        be used to load-balance traffic based on the presence of a listening
        application, or to implement stateful firewalling primitives to understand
        whether traffic for this connection has been seen before. With this new
        functionality, BPF programs can integrate more closely with the networking
        stack's understanding of the traffic transiting the kernel.

        Speaker: Joe Stringer (Cilium)
      • 11:35
        Experiences Evaluating DC-TCP 35m

        In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms. Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.

        Note: We plan to test on a much larger cluster and have those results available before the conference.

        Speakers: Lawrence Brakmo (Facebook), Boris Burkov (Facebook), Greg Leclercq (Facebook), Murat Mugan (Facebook)
      • 12:10
        Scaling Linux Bridge Forwarding Database 35m

        Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in most recent years on data center switches. It is complete in its feature set with forwarding, learning, proxy and snooping functions. It can bridge Layer-2 domains between VM's, Containers, Racks, POD's and between data centers as seen with Ethernet-Virtual Private networks [1, 2]. With Linux bridge deployments moving up the rack, it is now bridging Larger Layer-2 domains bringing in scale challenges. The bridge forwarding database can scale to thousands of entries on a data center switch with hardware acceleration support.

        In this paper we discuss performance and operational challenges with large scale bridge fdb database and solutions to address them. We will discuss solutions like fdb dst port failover for faster convergence, faster API for fdb updates from control plane and reducing number of fdb dst ports with Light weight tunnel endpoints for bridging over a tunneling solution (eg vxlan).

        Most solutions though discussed around the below deployment scenarios are generic and can be applied to all bridge use-cases:

        • Multi-chassis link aggregation scenarios where Linux bridge is part of the active-active switch redundancy solution
        • Ethernet VPN solutions where Linux bridge forwarding database is extended to reach Layer-2 domains over a network overlay like VxLAN

        [1] https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-11
        [2] https://www.netdevconf.org/2.2/slides/prabhu-linuxbridge-tutorial.pdf

        Speakers: Roopa Prabhu (Cumulus Networks), Nikolay Aleksandrov (Cumulus Networks)
      • 12:45
        Lunch 1h 15m
      • 14:00
        P4C-XDP: Programming the Linux Kernel Forwarding Plane Using P4 35m

        The eXpress Data Path (XDP) is a new kernel-feature, intended to provide
        fast packet processing as close as possible to device hardware. XDP
        builds on top of the extended Berkely Packet Filter (eBPF) and allows
        users to write a C-like packet processing program, which can be attached
        to the device driver’s receiving queue. When the device observes an
        incoming packet, the user-defined XDP program is triggered to execute on
        the packet payload, making the decision as early as possible before
        handing the packet down the processing pipeline.

        P4 is a domain-specific language describing how packets are processed by
        the data plane of a programmable network elements, including network
        interface cards, appliances, and virtual switches. It provides an
        abstraction that allows programmers to express existing and future
        protocol format without coupling it to any data plane specific
        knowledge. The language is explicitly designed to be protocol-agnostic.
        A P4 programmer can write their own protocols and load the P4 program
        into P4-capable network elements.
        As high-level networking language, P4 supports a diverse set of compiler
        backends and also possesses the capability to express eBPF and XDP programs.

        We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages
        XDP to aim for a high performance software data plane. The backend
        generates a eBPF-compliant C representation from a given P4 program
        which is passed to clang and llvm to produce the bytecode. Using
        conventional eBPF kernel hooks the program can then be loaded into the
        eBPF virtual machine in the device driver. The kernel verifier
        guarantees the safety of the generated code. Any packets
        received/transmitted from/to this device driver now trigger the
        execution of the loaded P4 program.

        The P4C-XDP project is an open source project hosted at
        https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample
        code under the tests directory, which contains a couple of examples such
        as basic protocol parsing, checksum recalculation, multiple tables
        lookups, and tunnel protocol en-/decapsulation.

        Speakers: Fabian Ruffy (University of British Columbia), Mihai Budiu (VMware), William Tu (VMware)
      • 14:35
        ERSPAN Support for Linux 35m

        Port mirroring is one of the most common network troubleshooting
        techniques. SPAN (Switch Port Analyzer) allows a user to send a copy
        of the monitored traffic to a local or remote device using a sniffer
        or packet analyzer. RSPAN is similar, but sends and received traffic
        on a VLAN. ERSPAN extends the port mirroring capability from Layer 2
        to Layer 3, allowing the mirrored traffic to be encapsulated in an
        extension of the GRE (Generic Routing Encapsulation) protocol and sent
        through an IP network. In addition, ERSPAN carries configurable
        metadatas (e.g., session ID, timestamps), so that the packet analyzer
        has better understanding of the packets.

        ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in
        4.16. The implementation includes both transmission and reception and
        is based on the existing ip_gre and ip6_gre kernel module. As a
        result, Linux today can act as an ERSPAN traffic source sending the
        ERSPAN mirrored traffic to the remote host, or an ERSPAN destination
        which receives and parses the ERSPAN packets generated from Cisco or
        other ERSPAN-capable switches.

        We’ve added both the native tunnel support and metadata-mode tunnel
        support. In this paper, we demonstrate three ways to use the ERSPAN
        protocol. First, for Linux users, using iproute2 to create native
        tunnel net device. Traffic sent to the net device will be
        encapsulated with the protocol header accordingly and traffic matching
        the protocol configuration will be received from the net device.
        Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN
        tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can
        read/write ERSPAN protocol’s fields in finer granularity. Finally,
        for Open vSwitch users, using the netlink interface to create a switch
        and programmatically parse, lookup, and forward the ERSPAN packets
        based on flows installed from the userspace.

        Speakers: William Tu (VMware), Greg Rose (VMware)
      • 15:10
        The Path to DPDK Speeds for AF_XDP 35m

        AF_XDP is a new socket type for raw frames to be introduced in 4.18
        (in linux-next at the time of writing). The current code base offers
        throughput numbers north of 20 Mpps per application core for 64-byte
        packets on our system, however there are a lot of optimizations that
        could be performed in order to increase this even further. The focus
        of this paper is the performance optimizations we need to make in
        AF_XDP to get it to perform as fast as DPDK.

        We present optimization that fall into two broad categories: ones that
        are seamless to the application and ones that requires additions to
        the uapi. In the first category we examine the following:

        • Loosen the requirement for having an XDP program. If the user does
          not need an XDP program and there is only one AF_XDP socket bound to
          a particular queue, we do not need an XDP program. This should cut
          out quite a number of cycles from the RX path.

        • Wire up busy poll from user space. If the application writer is
          using epoll() and friends, this has the potential benefit of
          removing the coherency communication between the RX (NAPI) core and
          the application core as everything is now done on a single
          core. Should improve performance for a number of use cases. Maybe it
          is worth revisiting the old idea of threaded NAPI in this context
          too.

        • Optimize for high instruction cache usage through batching as has
          been explored in for example Cisco's VPP stack and Edward Cree in
          his net-next RFC "Handle multiple received packets at each stage".

        In the uapi extensions category we examine the following
        optimizations:

        • Support a new mode for NICs with in-order TX completions. In this
          mode, the completion queue would not be used. Instead the
          application would simply look at the pointer in the TX queue to see
          if a packet has been completed. In this mode, we do not need any
          backpreassure between the completion queue and the TX queue and we
          do not need to populate or publish anything in the completion queue
          as it is not used. Should improve the performance of TX for in-order
          NICs significantly.

        • Introduce the "type-writer" model where each chunk can contain
          multiple packets. This is the model that e.g., Chelsio has in its
          NICs. But experiments show that this mode also can provide better
          performance for regular NICs as there are fewer transactions on the
          queues. Requires a new flag to be introduced in the options field of
          the descriptor.

        With these optimization, we believe we can reach our goal of close to
        40 Mpps of throughput for 64-byte packets in zero-copy mode. Full
        analysis with performance numbers will be presented in the final
        paper.

        Speakers: Björn Töpel (Intel), Magnus Karlsson (Intel)
      • 15:45
        Afternoon Break 20m
      • 16:05
        eBPF / XDP Based Firewall and Packet Filtering 35m

        iptables have been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers. In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.

        • Design and Implementation:

          • We use BPF Tables (maps, lpm tries, and arrays) to match for appropriate packet header contents
          • The heart of a firewall is a eBPF filter which parses a packet and does lookups against all relevant maps collecting the matching values. A logical rule set is applied to these collected values. This logical set reads similar to a human-readable high level firewall policy. With iptable rules, amidst all the verbose matching criteria inline every rule, such a policy level representation is hard to infer.
        • Performance benefits and comparisons with iptables

          • iptables does packet matching linearly against each rule until a match is found. In our proposal, we use BPF Tables (maps) containing keys for all rules, making packet matching highly efficient. We then apply the policy using the collected results, which results in a considerable speedup over iptables.
        • Ease of policy / config updates and maintenance

          • The network administrator owns the firewall while the app developers typically require opening ports for their applications to work. With our approach of using a eBPF filter, we create a logical separation between the filter which enforces the policy and the contents of the associated maps which represent the specific ports and prefixes that need to be filtered. The policy is owned by the network administrator (Example: ports open to the internet, ports open from within specific prefixes, drop everything else). The data (port numbers, prefixes, etc) can now belong to a separate logical section which presents application developers a predetermined destination to update their data (Example: File containing port opened to internal subnets, etc). This reduces friction between the 2 different functions and reduces human errors.
        • Deployment experience:

          • We deploy this solution in our edge infrastructure to implement our firewall policy.
          • We update configuration, reload filters and contents of the various maps containing keys and values for filtering
        • BPF Program array

          • We use the power of BPF program array to chain different programs like rate limiter, firewall, load balancers, etc. These are building blocks to create a rich, high performant networking solution
        • Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering

          • We present a proposal which can translate existing iptables rules to a better performant eBPF program with mostly user space processing and validation.
        Speakers: Anant Deepak (Facebook), Puneet Mehra (Facebook), Richard Huang (Facebook)
      • 16:40
        XDP Acceleration Using NIC Metadata, Continued 35m

        This talk is a continuation of the initial XDP HW-based hints work presented at NetDev 2.1 in Seoul, South Korea.

        It will start with focus on showcasing new prototypes to allow an XDP program to request required HW-generated metadata hints from a NIC. The talk will show how the hints are generated by the NIC and what are the performance characteristics for various XDP applications. We also want to demonstrate how such a metadata can be helpful for applications that use AF_XDP sockets.

        The talk with then discuss planned upstreaming thoughts, and look to generate more discussion around implementation details, programming flows, etc., with the larger audience from the community.

        Speakers: P. J. Waskiewicz (Intel), Neerav Parikh (Intel)
      • 17:15
        Linux SCTP is Catching Up and Going Above! 35m

        SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN
        IETF Working Group in the early 2000's with the initial objective of
        supporting the transport of PSTN signalling over IP networks. It featured
        multi-homing and multi-stream from the beginning, and since then there
        have been a number of improvements that help it serve other purposes too,
        such as support for Partial Reliability and Stream Scheduling.

        Linux SCTP arrived late and was stuck. It wasn't as up to date as the
        released RFCs, while it was also far behind other systems such as BSD,
        and also suffered from performance problems. In the past 2 years, we
        were dedicated to ensuring that these features were addressed and
        focused on making many improvements. Now all the features from released
        RFCs have been fully supported in Linux, and some from draft RFCs are
        already ongoing. Besides, we've seen an obvious improvement in performance
        in various scenarios.

        In this talk we will first do a quick review on SCTP basics, including:

        • Background: Why SCTP is used for PSTN Signalling Transport, why other
          applications are using or will use SCTP.
        • Architecture: The general SCTP structures and procedures implemented in
          Linux kernel.
        • VS TCP/UDP: An overview of functions and applicability of SCTP, TCP and
          UDP.

        Then go through the improvements that were made in the past 2 years,
        including:

        • SCTP-related projects in Linux: Other than kernel part, there are also
          lksctp-tools, sctp-tests, tahi-sctp, etc.
        • Features implemented lately: RFC ones like Stream Scheduling, Message
          Interleaving, Stream Reconfig, Partially Reliable Policy, and many
          CMSGs, SndInfos, Socket APIs.
        • Improvements made recently: Big patchsets like SCTP Offload, Transport
          Hashtable, SCTP Diag and Full SELinux support.
        • VS BSD: We will take a look at the difference between Linux and BSD now
          regarding SCTP. You will be surprised to see that we've gone further
          than other systems.

        We will finish by reviewing a list of what is on our radar as well as next
        steps, like:

        • Ongoing features: SCTP NAT and SCTP CMT, two big important features are
          ongoing and already taking form, and more Performance Improvements in
          kernel have also been started.
        • Code refactor: New Congestion Framework will be introduced, which will
          be more flexible for SCTP to extend more Congestion Algorithms.
        • Hardware support: HW CRC Checksum and GSO will definitely make performance
          better, for which a new segment logic for both .segment and HW that works
          for SCTP chunks is needed.
        • RFC docs improvements: We believe that more extensions and revisions will
          make SCTP more widespread.

        For its powerfulness and complexity, SCTP is destined to face many challenges
        and threats, but we believe that we have already and will continue to make it
        better than that on other systems, but also than other transport protocols.
        Please join us, Linux SCTP needs your help too!

        Speakers: Marcelo RIcardo Leitner (Red Hat), Xin Long (Red Hat)
    • 09:00 12:30
      Testing & Fuzzing MC Pavillion-Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-C

      Sheraton Vancouver Wall Center

      58

      The Linux Plumbers 2018 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.

      Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.

      Plans for participants and talks will be similar to last year's (https://blog.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/641).

      • 10:30
        Break 30m
    • 09:00 12:30
      Thermal MC Pavillion-Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-D

      Sheraton Vancouver Wall Center

      77

      This proposal is to gather hackers interested in improving the thermal subsystem in the Linux Kernel and its interaction with hardware and userspace based policies. Nowadays, given the nature of the workloads and the wide spectrum of devices that Linux is used into, the peers interested in improving the thermal subsystem come from different backgrounds and with use cases of diverse thermal constrained systems, ranging from embedded devices to systems with high computing power. Despite the heterogeneity of software solutions to control thermals, the thermal subsystem is still core of many of them, including policies that rely on hardware configured thresholds, or interactions with firmware based control loops, or even policies that rely on userspace daemons. Therefore, this micro conference aims on gathering the thermal interested developers of the community to discuss further improvements of the Linux thermal subsystem.

      • 09:00
        Thermal User space (tools/governors/interfaces) 30m
        Speaker: Srinivas Pandruvada (Intel)
      • 09:30
        Scheduler interactions with thermal management 30m

        A discussion around the thermal pressure patches recently posted to LKML

        Speaker: Ms. Thara Gopinath (Linaro)
      • 10:00
        Idle injection 30m

        A discussion around using idle injection as a means to do thermal management

        Speaker: Daniel Lezcano (Linaro)
      • 10:30
        Break 30m
      • 11:00
        Improvements on thermal zone mode 25m
        Speaker: Rui Zhang (Intel)
      • 11:25
        Better support for virtual temperature sensors 30m

        A discussion on creating virtual temperature sensors that acts as aggregators for physical sensors, thereby allowing all the framework operations

        Speaker: Eduardo Valentin (Linux)
      • 11:55
        Thermal usecases and how to handle them 35m

        Discussion around some mobile and enterprise usecases that don't fit very well in the current framework and proposals for possible solutions.

        These include virtual sensors, heirachical thermal zones, multiple sensor per thermal zone support, extending governors to tackles tempertaure ranges.

        Speakers: Amit Kucheria, Eduardo Valentin (Linux)
    • 14:00 17:30
      Containers MC Junior-Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior-Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

      • 15:30
        Break 30m
    • 14:00 17:30
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 14:00 17:30
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        Android and the kernel: herding billions of penguins, one version at a time 45m

        Historically, kernels that ran on Android devices have typically been 2+ years old compared to mainline (this year's flagship devices are shipping with 4.9 kernels) and because of the challenges associated with updating, most devices in the field are far behind the latest long-term stable (LTS) releases. The Android team has been gradually putting in place the necessary processes and enhancements to permanently bridge this gap. Much of the work on the Android kernel in 2018 focused on improving the ability to update the kernel -- at least to recent LTS levels. This work comprises a significant testing effort to ensure downstream partners that updating to new LTS levels is safe, as well as process work to convince partners that the security benefits of taking LTS patches far outweigh the risk of new bugs. The testing also focuses on ABI consistency (within LTS releases) for interfaces relied upon by userspace and kernel modules. This has resulted in enhancements to the LTP suite and a new proposal to the mailing list for "kernel namespaces".

        Additionally, the Android kernel testing benefits from additional tools developed by Google that are enabled via the Clang compiler. Google's devices have been shipping kernels built via Clang for 2 years. The Android team tests and assists in maintaining arm and arm64 kernel builds with clang.

        The talk will also cover some of the key features being developed for Android and introduce topics that will be discussed during the Android Micro-Conference.

        Speaker: Patil Sandeep (Google)
      • 14:45
        Heterogeneous Memory Management 45m

        Heterogeneous computing use massively parallel devices, such as GPU, to crunch through huge data-set. This talks intends to present the issues, challenges and problems related to memory management and heterogeneous computing. Issues and problems from one address space per device which makes exchanging or sharing data-set between devices and CPUs hard, complex and error prone.

        Solutions involve a unified address space between devices and CPU often call SVM (Share Virtual Memory) or SVA (Share Virtual Address). In those unified address space a virtual address valid on CPUs is also valid on the devices. Talk will address both hardware and software solutions to this problem. Moreover it will consider ways to preserve the ability to use the device memory in those scheme.

        Ultimately this talks is an opportunity to discuss memory placement, like for NUMA architecture, in a world where we not only have to worry about CPU but also about devices like GPU and their associated memory.

        If it were not enough, we now also have to worry about memory hierarchy for each CPU or device. Memory hierarchy going from fast High Bandwidth Memory (HBM) to main memory (DDR DIMM) which can be order of magnitude slower, and finally to persistent memory which is large in size but slower and with higher latency.

        Speaker: Jerome Glisse (Red Hat)
      • 15:30
        Break 30m
      • 16:00
        Documenting Linux MM for fun and for ... fun 45m

        It is well known that developers do not like writing documentation. But although documenting the code may seem dull and unrewarding, it has definite value for the writer.

        When you write the documentation you gain an insight into the algorithms, design (or lack of such), and implementation details. Sometimes you see neat code and say "Hey, that's genius!". But sometimes you discover small bugs or chunks of code that beg for refactoring. In any case, your understanding of the system significantly improves.

        I'd like to share the experience I had with Linux memory management documentation, what was it's state a few months ago, what have been done and where are we now.

        The work on the memory management documentation is in progress and the question "Where do we want to be?" is definitely a topic for discussion and debate.

        Speaker: Mike Rapoport (IBM)
      • 16:45
        Towards a Linux Kernel Maintainer Handbook 45m

        The first rule of kernel maintenance is that there are no hard and fast rules. While there are several documents and guidelines on patch contribution, advice on how to serve in a maintainer role has historically been tribal knowledge. This organically grown state of affairs is both a source strength and a source of friction. It has served the community well to be adaptable to the different personalities and technical problem spaces that inhabit the kernel community. However, that variability also leads to inconsistent experiences for contributors across subsystems, insufficient guidance for new maintainers, and excess stress on current maintainers. As the Linux kernel project expects to continue its rate of growth it needs to be able both scale the maintainers it has and ramp new ones without necessarily requiring them to make a decade's worth of mistakes to become proficient.

        The presentation makes the case for why a maintainer handbook is needed, including frequently committed mistakes and commonly encountered pain points. It broaches the "whys" and "hows" of contributors having significantly different experiences with the Linux Kernel project depending on what subsystem they are contributing. The talk is an opportunity to solicit maintainers in the audience on the guidelines they would reference on an ongoing basis, and it is an opportunity for contributors to voice wish list items when working with upstream. Finally, it will be a call to action to expand the document with subsystem-local rules of the road where those local rules differ, or go beyond the common recommendations.

        Speaker: Dan Williams (Intel Open Source Technology Center)
    • 14:00 17:30
      RT MC Pavillion-Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-C

      Sheraton Vancouver Wall Center

      58

      Since 2004 a project has been going on trying to make the Linux kernel into a true hard Real-Time operating system. This project has become know as PREEMPT_RT (formally the "real-time patch"). Over the past decade, there was a running joke that this year PREEMPT_RT would be merged into the mainline kernel, but that has never happened. In actuality, it has been merged in pieces. Examples of what came from PREEMPT_RT include: mutexes, high resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The only thing left is turning spin_locks into mutexes, and that is now mature enough to make its way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged!

      Getting PREEMPT_RT into the kernel was a battle, but it is not the end of the war. Once PREEMPT_RT is in mainline, there's still much more work to be done. The RT developers have been so focused on getting RT into mainline, that little has been thought about what to do when it is finally merged. There is a lot to discuss about what to do after RT is in mainline. The game is not over yet.

    • 09:00 13:35
      Device Tree MC Pavillion-Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-D

      Sheraton Vancouver Wall Center

      77

      Topics
      Binding and Devicetree Source/DTB Validation: update and next steps
      Binding specification format
      Validation Process and Process
      How to validate overlays
      Devicetree Specification: update and next steps

      Reducing devicetree memory and storage size

      Overlays

      Bootloader and Linux kernel implementation update
      Remaining blockers and issues
      Use cases
      Devicetree compiler (dtc)

      Next version of DTB/FDT format
      Motivated by desire to replace metadata being encoded as normal data (metadata for overlays)
      Other desired changes should be considered
      Boot and Run-time Configuration
      Pain points and needs
      Multi-bus devices

      Feedback from the trenches

      how DTOs are used in embedded devices in practice
      in U-Boot and Linux
      in systems with FPGAs
      Use of devicetrees in small code/data space (e.g. U-Boot SPL)

      Connector node bindings

      FPGA issues

      • 09:00
        Welcome & Introduction 15m
        Speakers: Frank Rowand, Mr. Rob Herring (Linaro)
      • 09:15
        Q&A Session 1 + General Discussion 15m
        Speakers: Frank Rowand, Mr. Rob Herring (Linaro)
      • 09:30
        Binding Specifications + Base DeviceTree Source Validation 45m
        Speaker: Mr. Rob Herring (Linaro)
      • 10:15
        DT memory (kernel), DT memory (bootloader), 7 storage (FDT) size 20m
        Speakers: Frank Rowand, Mr. Rob Herring (Linaro), Mr. Simon Glass (Google)
      • 10:35
        Break 25m
      • 11:00
        FPGA + DT 20m
        Speaker: Mr. Moritz Fischer
      • 11:20
        DT Specification update 10m
        Speaker: Mr. Rob Herring (Linaro)
      • 11:30
        New FDT format & Overlays 20m
        Speaker: Frank Rowand
      • 11:50
        Q&A Session #2 20m
        Speakers: Frank Rowand, Mr. Rob Herring (Linaro)
      • 12:10
        Summary, Action Items, and closing 20m
        Speaker: Frank Rowand
    • 09:00 12:30
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 09:00 12:30
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        Exploring New Frontiers in Container Technology 45m

        Containers (or Operating System based Virtualization) are an old
        technology; however, the current excitement (and consequent
        investment) around containers provides interesting avenues for
        research on updating the way we build and manage container technology.
        The most active area of research today, thanks to concerns raised by
        groups supporting other types of virtualization, is in improving the
        security properties of containers.

        The first step in improving security is actually being able to measure
        it in the first place, so the initial goal of a research programme for
        container security involves finding that measure. In this talk I'll
        outline one such measure (attack profiles) developed by IBM research,
        the useful results that can be derived from it, the problems it has
        and the avenues that can be explored to refine future measurements of
        containment.

        Contrary to popular belief, a "container" doesn't describe one fixed
        thing, but instead is a collective noun for a group of isolation and
        resource control primitives (in Linux terminology called namespaces
        and cgroups) the composition of which can be independently varied. In
        the second half of this talk, we'll explore how containment can be
        improved by replacing some of the isolation primitives with local
        system call emulation sandboxes, a promising technique used by both
        the Google gVisor and the IBM Nabla secure container systems. We'll
        also explore the question of whether sandboxes are the end point of
        container security research or merely point the way to the next
        Frontier for container abstraction.

        Speaker: James Bottomley (IBM)
      • 09:45
        Open Source GPU compute stack - Not dancing the CUDA dance 45m

        Using graphics cards for compute acceleration has been a major shift in technology lately, especially around AI/ML and HPC.

        Until now the clear market leader has been the CUDA stack from NVIDIA, which is a closed source solution that runs on Linux. Open source applications like tensorflow (AI/ML) rely on this closed stack to utilise GPUs for acceleration.

        Vendor aligned stacks such as AMD's ROCm and Intel's OpenCL NEO are emerging that try to fill the gap for their specific hardware platforms. These stacks are very large, and don't share much if any code. There are also efforts inside groups like Khronos with their OpenCL, SPIR-V and SYCL standards being made to produce something that can work as a useful standardised alternative.

        This talk will discuss the possibility of creating a vendor neutral reference compute stack based around open source technologies and open source development models that could execute compute tasks across multiple vendor GPUs. Using SYCL/OpenCL/Vulkan and the open-source Mesa stack, as the basis for a future task that development of tools and features on top of as part of a desktop OS.

        This talk doesn't have all the answers, but it wants to get people considering what we can produce in the area.

        Speaker: David Airlie
      • 10:30
        Break 30m
      • 11:00
        Proactive Defense Against CPU Side Channel Attacks 45m

        Side channel attacks are here to stay. What can we do inside the operating system to proactively defend against them? This talk will walk through a few of the ideas that Intel’s Open Source Technology Center are developing to improve our resistance to side channel attacks as part of our new side channel defense project. We would also like to gather ideas from the rest of the community on what our top priorities for side channel defense for the Linux kernel should be.

        Speaker: Kristen Accardi
      • 11:45
        Untrusted Filesystems 45m

        Plugging in USB sticks, building VM images, and unprivileged containers all give rise to a situation where users are mounting and dealing with filesystem images they have not built themselves, and don't necessarily want to trust.

        This leads to the problem of how to mount and read/write those filesystems without opening yourself up to more risk than visiting a web page.

        I will survey what has been built already, describe what the technical challenges and describe the problems ahead.

        With this talk I hope to unite the various groups across the linux ecosystem that care about this problem and get the discussion started on how we can move forward.

        Speaker: Eric Biederman
    • 09:00 18:00
      Networking Track Junior-Ballroom-C (Sheraton Vancouver Wall Center)

      Junior-Ballroom-C

      Sheraton Vancouver Wall Center

      67

      A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.

      Official Networking Track website: http://vger.kernel.org/lpc-networking.html

      • 09:00
        Daily openning, announcements, etc. 20m
      • 09:55
        Combining kTLS and BPF for Introspection and Policy Enforcement 35m

        This talk is divided into two parts, first we present on kTLS, the current kernel's
        sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and
        strparser framework which is utilized by both in order to hook into socket callbacks
        and determine message boundaries for subsequent processing.

        We further elaborate on the challenges we face when trying to combine kTLS with the
        power of BPF for the eventual goal of allowing in-kernel introspection and policy
        enforcement of application data before encryption. Besides others, this includes a
        discussion on various approaches to address the shortcomings of the current ULP layer,
        optimizations for strparser, and the consolidation of scatter/gather processing for
        kTLS and sockmap as well as future work on top of that.

        Speakers: Daniel Borkmann (Cilium), John Fastabend (Cilium)
      • 10:30
        Morning Break 30m
      • 11:00
        Optimizing UDP for Content Delivery with GSO, Pacing and Zerocopy 35m

        UDP is a popular foundation for new protocols. It is available across
        operating systems without superuser privileges and widely supported
        by middleboxes. Shipping protocols in userspace on top of
        a robust UDP stack allows for rapid deployment, experimentation
        and innovation of network protocols.

        But implementing protocols in userspace has limitations. The
        environment lacks access to features like high resolution timers
        and hardware offload. Transport cost can be high. Cycle count of
        transferring large payloads with UDP can be up to 3x that of TCP.

        In this talk we present recent and ongoing work, both by the authors
        and others, at improving UDP for content delivery.

        UDP Segmentation offload amortizes transmit stack traversal by
        sending as many as 64 segments as one large fused large packet.
        The kernel passes this through the stack as one datagram, then
        splits it into multiple packets and replicates their network and
        transport headers just before handing to the network device.

        Some devices can offload segmentation for exact multiples of
        segment size. We discuss how partial GSO support combines the
        best of software and hardware offload and evaluate the benefits of
        segmentation offload over standard UDP.

        With these large buffers, MSG_ZEROCOPY becomes effective at
        removing the cost of copying in sendmsg, often the largest
        single line item in these workloads. We extend this to UDP and
        evaluate it on top of GSO.

        Bursting too many segments at once can cause drops and retransmits.
        SO_TXTIME adds a release time interface which allows offloading of
        pacing to the kernel, where it is both more accurate and cheaper.
        We will look at this interface and how it is supported by queuing
        disciplines and hardware devices.

        Finally, we look at how these transmit savings can be extended to
        the forwarding and receive paths through the complement of GSO,
        GRO, and local delivery of fused packets.

        Speaker: Willem de Bruijn (Google)
      • 11:35
        Bringing the Power of eBPF to Open vSwitch 45m

        Among the various ways of using eBPF, OVS has been exploring the power
        of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of
        processing to XDP, and (3) by-passing the kernel using AF_XDP.
        Unfortunately, as of today, none of the three approaches satisfies the
        requirements of OVS. In this presentation, we’d like to share the
        challenges we faced, experience learned, and seek for feedbacks from
        the community for future direction.

        Attaching eBPF to TC started first with the most aggressive goal: we
        planned to re-implement the entire features of OVS kernel datapath
        under net/openvswitch/* into eBPF code. We worked around a couple of
        limitations, for example, the lack of TLV support led us to redefine a
        binary kernel-user API using a fixed-length array; and without a
        dedicated way to execute a packet, we created a dedicated device for
        user to kernel packet transmission, with a different BPF program
        attached to handle packet execute logic. Currently, we are working on
        connection tracking. Although a simple eBPF map can achieve basic
        operations of conntrack table lookup and commit, how to handle NAT,
        (de)fragmentation, and ALG are still under discussion.

        Moving one layer below TC is called XDP (eXpress Data Path), a much
        faster layer for packet processing, but with almost no extra packet
        metadata and limited BPF helpers support. Depending on the complexity
        of flows, OVS can offload a subset of its flow processing to XDP when
        feasible. However, the fact that XDP has fewer helper function support
        implies that either 1) only very limited number of flows are eligible
        for offload, or 2) more flow processing logic needed to be done in
        native eBPF.

        AF_XDP represents another form of XDP, with a socket interface for
        control plane and a shared memory API for accessing packets from
        userspace applications. OVS today has another full-fledged datapath
        implementation in userspace, called dpif-netdev, used by DPDK
        community. By treating the AF_XDP as a fast packet-I/O channel, the
        OVS dpif-netdev can satisfy almost all existing features. We are
        working on building the prototype and evaluating its performance.

        RFC patch:
        OVS eBPF datapath.
        https://www.mail-archive.com/iovisor-dev@lists.iovisor.org/msg01105.html

        Speakers: William Tu (VMware), Joe Stringer (Isovalent), Yi-Hung Wei (VMware), Yifeng Sun (VMware)
      • 12:30
        Lunch 1h 30m
      • 14:00
        What's Happened to the World of Networking Hardware Offloads? 35m

        Over the last 10 years the world has seen NICs go from single port,
        single netdev devices, to multi-port, hardware switching, CPU/NFP
        having, FPGA carrying, hundreds of attached netdev providing,
        behemoths. This presentation will begin with an overview of the
        current state of filtering and scheduling, and the evolution of the
        kernel and networking hardware interfaces. (HINT: it’s a bit of a
        jungle we’ve helped grow!) We’ll summarize the different kinds of
        networking products available from different vendors, and show the
        workflows of how a user can use the network hardware
        offloads/accelerations available and where there are still gaps. Of
        particular interest to us is how to have a useful, generic hardware
        offload supporting infrastructure (with seamless software fallback!)
        within the kernel, and we’ll explain the differences between deploying
        an eBPF program that can run in software, and one that can be
        offloaded by a programmable ASIC based NIC. We will discuss our
        analysis of the cost of an offload, and when it may not be a great
        idea to do so, as hardware offload is most useful when it achieves the
        desired speed and requires no special software (kernel changes). Some
        other topics we will touch on: the programmability exposed by smart
        NICs is more than that of a data plane packet processing engine and
        hence any packet processing programming language such as eBPF or P4
        will require certain extensions to take advantage of the device
        capabilities in a holistic way. We’ll provide a look into the future
        and how we think our customers will use the interfaces we want to
        provide both from our hardware, and from the kernel. We will also go
        over the matrix of most important parameters that are shaping our HW
        designs and why.

        Speakers: Jesse Brandeburg (Intel), Anjali Singhai Jain (Intel)
      • 14:35
        XDP 1.5 Years In Production. Evolution and Lessons Learned. 35m

        Today every packet which is reaching Facebook’s network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I’m going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I’m going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.

        Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:

        • why helpers such as bpf_adjust_head/bpf_adjust_tail has been added
        • unittesting and microbenchmarking with bpf_prog_test_run: how to add test coverage of you BPF program and track the regression (we are going to cover how spectre affected BPF kernel infrastructure and what tweaks has been made to get some performance back)
        • how map-in-map helps us to scale and make sure that we don't waste memory
        • NUMA aware allocation for BPF maps
        • inline lookups for BPF arrays/map-in-map

        Lessons which we have learned during operation of XDP:

        • BPF instruction counts vs complexity
        • How to attach more then one XDP program to the interface
        • when LLVM and verifier are not the same: some tricks to force LLVM to generate proper BPF
        • we will briefly discuss HW limitation: NIC's bandwidth vs packet per second performance

        Missing parts: what and why could be added:

        • the need for hardware checksumming offload
        • bounded loops: what they would allow us to do
        Speaker: Nikita V. Shirokov (Facebook)
      • 15:10
        Keynote: "This Talk Is Not About XDP: From Resource Limits to SKB Lists" 25m
        Speaker: David Miller (Red Hat Inc.)
      • 15:35
        Afternoon Break 25m
      • 16:00
        TC SW Datapath: A Performance Analysis 35m

        Currently the Linux kernel implements two distinct datapaths for Open
        vSwitch: the ovskdp and the TC DP. The latter has been added recently
        mainly to allow HW offload, while the former is usually preferred for
        SW based forwarding due to functional and performance reasons.

        We evaluate both datapaths in a typical forwarding scenario - the PVP
        test - using the perf tool to identify bottlenecks in the TC SW dp.
        While similar steps usually incur in similar costs, the TC SW DP
        requires an additional, per packet, skb_clone, due to a TC actions
        constraint.

        We propose to extend the existing act infrastructure, leveraging the
        ACT_REDIRECT action and the bpf redirect code, to allow clone-free
        forwarding from the mirred action and then re-evaluate the datapaths
        performances: the gap is then almost already closed.

        Nevertheless, TC SW performance can be further improved by completing
        the RCU-ification of the TC actions and expanding the recent
        listification infrastructure to the TC (ingress) hook. We plan also to
        compare the TC/SW datapath with an custom eBPF program implementing the
        equivalent flow set to tag a reference value for the target
        performances.

        Speakers: Paolo Abeni (Red Hat), Davide Caratti (Red Hat), Eelco Chaudron (Red Hat), Marcelo Ricardo Leitner (Red Hat)
      • 16:35
        Using eBPF as an Abstraction for Switching 35m

        eBPF (extended Berkeley Packet Filter) has been shown to be a flexible
        kernel construct used for a variety of use cases, such as load balancing,
        intrusion detection systems (IDS), tracing and many others. One such
        emerging use case revolves around the proposal made by William Tu for
        the use of eBPF as a data path for Open vSwitch. However, there are
        broader switching use cases developing around the use of eBPF capable
        hardware. This talk is designed to explore the bottlenecks that exist in
        generalising the application of eBPF further to both container switching as
        well as physical switching.

        Topics that will be covered include proposals for container isolation through
        the use of features such as programmable RSS, the viability of physical
        switching using eBPF capable hardware as well as integrations with other
        subsystems or additional helper functions which may improve the possible
        functionality.

        Speaker: Nick Viljoen (Netronome)
      • 17:10
        BPF Host Network Resource Management 35m

        Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.

        For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.

        We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.

        Speakers: Lawrence Brakmo (Facebook), Alexei Starovoitov (Facebook)
      • 17:45
        Closing 15m
    • 09:00 12:30
      Performance and Scalability MC Pavillion-Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-C

      Sheraton Vancouver Wall Center

      58

      Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.

      Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.

      We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.

      • 09:00
        Scheduler task accounting for cgroups 15m

        Cgroup accounting has significant overhead due to the need to constantly loop over all the cpus to update statistics of cpu usages and blocked averages. We have seen that on 4 socket Haswell, data base benchmarks like TPCC has 8% performance regression at the time of Haswell and 4.4 kernel when cgroup when it is run under cgroup. On recent Cannon Lake platform using latest PCIE SSDs and 4.18 kernel, the regression in scheduler has gotten worse to 12%. We will highlight the bottlenecks in the scheduler with detailed profiles of the hot path. We'll like to explore possible avenues to improve cgroup accounting.

        Speaker: Tim Chen
      • 09:15
        Seamless update hypervising kernel 15m

        Discuss two possible approaches to live update Linux that runs as a hypervisor without a noticeable effect on running Virtual Machines (VM). One method is to use cooperative multi-OSing paradigm to share the same machine between two kernels while the new kernel is booting, and the old kernel is still serving the running VM instances. Allow, the new kernel to live migrate the drivers from the old kernel by using shadow class drivers, and later do the live migration of running VMs without copying their memory. The second method is to boot new kernel in a fully virtualized environment, that is the same as the underlying hardware, live migrate the VMs into the newly booted hypervisor, and make the hypervisor to transition from VM environment onto bare metal.

        Speaker: Pavel Tatashin
      • 09:30
        Load balancing via scalable task stealing 30m

        Summary:
        In this talk I discuss scalability of load balancing algorithms in the task scheduler, and present my work on tracking overloaded CPUs with a bitmap, and using the bitmap to steal tasks when CPUs become idle.

        Abstract:
        The scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the push and pull sides on large systems. For pulls, the scheduler searches all CPUs in successively larger domains until an overloaded CPU is found, and pulls a task from the busiest group. This is very expensive, so search time is limited by the average idle time, and some domains are not searched. Balance is not always achieved, and idle CPUs go unused.

        I propose an alternate mechanism that is invoked after the existing search limits itself and finds nothing. I maintain a bitmap of overloaded CPUs, where a CPU sets its bit when its runnable CFS task count exceeds 1. The bitmap is sparse, with a limited number of significant bits per cacheline. This reduces cache contention when many threads concurrently set, clear, and visit elements. There is a bitmap per last-level cache. When a CPU becomes idle, it finds the first overloaded CPU in the bitmap and steals a task from it. For certain configurations and test cases, this optimization improves hackbench performance by 27%, OLTP by 9%, and tbench by 16%, with a minimal cost in search time. I present schedstat data showing the change in vital scheduler metrics before and after the optimization.

        For now the new stealing is confined to the LLC to avoid NUMA effects, but it could be extended to steal across nodes in the future. It could also be extended to the realtime scheduling class. Lastly, the sparse bitmap could be used to track idle cores and idle CPUs and used to optimize balancing on the push side.

        Speaker: Steven Sistare (Oracle)
      • 10:00
        Scheduler and pipe sleep wakeup scalability 30m

        1) Scalability of scheduler idle cpu and core search on systems with large number of cpus

        Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. These don't scale for large llc domains and will only get worse with more cores in future. Spending too much time in the scheduler will hurt performance of very context switch intensive workloads. A more scalable way to do the search is desirable which is not O(no. of cpus) or O(no. of cores) in worst case.

        2) Scalability of idle cpu stealing on systems with large number of cpus and domains

        When a cpu becomes idles it tries to steal threads from other overloaded cpus using idle_balance. idle_balance does more work because it searches widely for the busiest CPU to offload, so to limit its CPU consumption, it declines to search if the system is too busy. A more scalable/lightweight way of stealing is desirable so that we can always try to steal with very little cost.

        3) Discuss workloads that use pipes and can benefit from pipe busy waits

        When pipe is full or empty a thread goes to sleep immediately. If the sleep wakeup happens very fast, the cost of sleep wakeup overhead can hurt a very context switch sensitive workload that is using pipes heavily. A few microseconds of busy wait before sleeping can avoid the overhead and improve the performance. Network sockets has similar capability. So far hackbench with pipe shows huge improvements, want to discuss other potential use cases.

        Speaker: Subhra Mazumdar
      • 10:30
        Break 30m
      • 11:00
        Promoting huge page usage 30m

        Huge pages are essential to addressing performance botttlenecks
        since the base page sizes are not changing while the amount of memory is
        ever increasing. Huge pages can address TLB misses but also memmory
        overhead in the Linux kernel that arises through page faults and other
        compute intensive processing of small pages. Huge pages are required
        with contemporary high speed NVME ssds to reach full throughput because
        the I/O overhead can be reduced and large contiguous memory I/O can then
        be scheduled by the devices. However, using huge pages often requires the
        modification of applications if transparent huge pages cannot be used.
        Transparent huge pages also require application specific setup to work
        effectively.

        Speaker: Christopher Lameter (Jump Trading LLC)
      • 11:30
        Workqueues and CPU Hotplug 15m

        Flexible workqueue: Currently we have two pool setting-up for workqueue: 1) per-cpu workqueue pool and 2) unbound workqueue pool, the former require the users of workqueues to have some knowledge of cpu online state, as shown in:

        https://lore.kernel.org/lkml/20180625224332.10596-2-paulmck@linux.vnet.ibm.com/T/#u

        While the latter (unbound workqueue) only has one pool per-NUMA, and that may hurt the scalability if we want to run multiple tasks in parallel inside a NUMA node.

        Therefore, that is a clear requirement for having a setting-up for workqueue to provide flexible level of parallelism (i.e. could run as many tasks as possible while save users from worrying about race with cpu hotplug).

        We'd like to have a session to talk about requirement and possible solution.

        Speaker: Boqun Feng
      • 11:45
        ktask: Parallelizing CPU-intensive kernel work 15m

        Certain CPU-intensive tasks in the kernel can benefit from multithreading, such as zeroing large ranges of memory, initializing massive state (struct page) at boot, VFIO page pinning, XFS quotacheck, and freeing memory on munmap/exit. There is currently no interface that provides this service. ktask is a framework built on workqueues that splits up the work, chooses the number of threads to use, synchronizes these threads, and load balances the work between them. I want to discuss current issues with this work, including allowing ktask threads to play well with the scheduler, cgroup awareness so ktask threads are throttled appropriately, and appropriately enabling ktask according to power management settings.

        Speaker: Daniel Jordan
      • 12:00
        Reducing the number of users of mmap_sem 15m

        The mmap_sem has long been a contention point in the memory management
        subsystem. In this session some mmap_sem related topics will be
        discussed. Some optimization has been merged by the upstream kernel to
        solve holding mmap_sem for write for excessive period of time in
        munmap path by downgrading write mmap_sem to read. And, some
        optimization are under discussion on the mailing list, i.e. release
        mmap_sem earlier for page cache readahead, speculative page fault.
        There is still optimization room by figuring out just what mmap_sem
        protects. It covers access to many fields in the mm_struct structure.
        It is also used for the virtual memory area (VMA) red-black tree, the
        process VMA list, and various fields within the VMA structure itself.
        Finer grain locks might be better to replace mmap_sem to reduce
        contention, i.e. range lock or per vma lock.

        Speaker: Yang Shi (Alibaba Group)
      • 12:15
        Performance and scalability MC Closing 15m
        Speakers: Daniel Jordan, Pavel Tatashin, Ying Huang
    • 10:30 11:00
      Containers MC: Break Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35

      The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

    • 14:00 17:30
      Android MC Pavillion-Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-D

      Sheraton Vancouver Wall Center

      77

      Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other areas where there is ongoing work include the low memory killer, dynamically-allocated Binder devices, kernel namespaces, EAS, userdata FS checkpointing and DT.

      The working planning document is here:
      https://docs.google.com/spreadsheets/d/1ymzOB4wapccX6t1b11T2-m9ny8buN7EuUqhCxrsmKe4

      • 15:30
        Break 30m
    • 14:00 17:30
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 14:00 17:30
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        What could be done in the kernel to make strace happy 45m

        What could be done in the kernel to make strace happy.

        Being a traditional tool with a long history, strace has been making every effort to overcome various deficiencies in the kernel API. Unfortunately, some of these workarounds are fragile, and in some cases no workaround is possible. In this talk maintainers of strace will describe these deficiencies and propose extensions to the kernel API so that tools like strace could work in a more reliable way.

        1

        Problem: there is no kernel API to find out whether the tracee is entering or exiting syscall.

        Current workarounds: strace does its best to sort out and track ptrace events, this works in most cases, but in case of strace attaching to a tracee being inside exec when its first syscall stop is syscall-exit-stop instead of syscall-enter-stop, the workaround is fragile, and in infamous case of int 0x80 on x86_64 there is no reliable workaround.

        Proposed solution: extend the ptrace API with PTRACE_GET_SYSCALL_INFO request.

        2

        Problem: there is no kernel API to invoke wait4 syscall with changed signal mask.

        Current workarounds: strace does its best to implement a race-free workaround, but it is way too complex and hard to maintain.

        Proposed solution: add wait6 syscall which is wait4 with additional signal mask arguments, like pselect vs select and ppoll vs poll.

        3

        Problem: time precision provided by struct rusage is too low for strace -c nowadays.

        Current workarounds: none.

        Proposed solution: when adding wait6 syscall, change struct rusage argument to a different structure with fields of type struct timespec instead of struct timeval.

        4

        Problem: PID namespaces have been introduced without a proper kernel API to translate between tracer and tracee views of pids. This causes confusion among strace users, e.g. https://bugzilla.redhat.com/1035433

        Current workarounds: none.

        Proposed solution: add translate_pid syscall, e.g. https://lkml.org/lkml/2018/7/3/589

        5

        Problem: there are no consistent declarative syscall descriptions, this forces every user to reinvent its own wheel and catch up with the kernel.

        Current workarounds: a lot of manual work has been done in strace to implement parsers of all syscalls. Some of these parsers are quite complex and hard to test. Other projects, e.g. syzkaller, implement their own representation of syscall ABI.

        Proposed solution: provide declarative descriptions for all syscalls consistently.

        Speakers: Dmitry Levin (BaseALT), Elvira Khabirova (BaseALT), Eugene Syromyatnikov (RedHat)
      • 14:45
        Formal Methods for Kernel Hackers 45m

        Formal methods have a reputation of being difficult, accessible mostly to academics and of little use to the typical kernel hacker. This talk aims to show how, without "formal" training, one can use such tools for the benefit of the Linux kernel. It will introduce a few formal models that helped find actual bugs in the Linux kernel and start a discussion around future uses from modelling existing kernel implementation (e.g. cpu hotplug, page cache states, mm_user/mm_count) to formally specifying new design choices. The introductory examples are written in PlusCal (an algorithm language based on TLA+) but no prior knowledge is required.

        Speaker: Catalin Marinas
      • 15:30
        Break 30m
      • 16:00
        Managing Memory Bandwidth Antagonism at Scale 45m

        Providing a consistent and predictable performance experience for applications is an important goal for cloud providers. Creating isolated job domains in a multi-tenant shared environment can be extremely challenging. At Google, performance isolation challenges due to memory bandwidth has been on the rise with newer workloads. This talk covers our attempt to understand and mitigate isolation issues caused by memory bandwidth saturation.

        The recent Intel RDT support in Linux helps us both monitor and manage memory bandwidth use on newer platforms. However, it still leaves a large chunk of our fleet at risk of memory bandwidth issues. The talk covers three aspects of our isolation attempts:

        1. At Google and Borg, we run all application in containers. Our first attempt was to estimate memory bandwidth utilization for each container on all supported platform by using existing performance counters. The talk will cover details on our approximation methodology and issues we identified in monitoring as well as some usage trends across different workloads.
        2. The second part of our effort was focussed on building actuators and policies for memory bandwidth control. We will cover multiple iterations of our enforcement efforts at node and cluster level with production use-cases and lessons learnt.
        3. For newer platforms, we attempted to use Intel RDT support via the resctrl interface. We ran into issues on both the monitoring and isolation side. We’ll discuss the fixes and workarounds we used and changes we proposed for resource-control support in Linux.

        We believe the problems and trends we have observed are universally applicable. We hope to inform and initiate discussion around common solutions across the community.

        Speakers: Mr. Rohit Jnagal (Google Inc), Mr. David Lo (Google Inc), Mr. Dragos Sbirlea (Google Inc)
      • 16:45
        oomd: a userspace OOM killer 45m

        Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. In this talk, Daniel Xu will cover why the Linux kernel OOM killer is surprisingly ineffective and how oomd, a newly opensourced userspace OOM killer, does a more effective and reliable job. Not only does the switch from kernel space to userspace result a more flexible solution, but it also directly translates to better resource utilization. His talk will also do a deep dive into the Linux kernel changes and improvements necessary for oomd to operate.

        Speaker: Mr. Daniel Xu (Facebook)
    • 14:00 17:30
      Toolchain Microconference Junior-Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior-Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The GNU Toolchain and Clang/LLVM play a critical role at the nexus of the Linux Kernel, the Open Source Software ecosystem, and computer hardware. The rapid innovation and progress in each of these components requires greater cooperation and coordination. This Toolchain Microconference will explore recent developments in the toolchain projects, the roadmaps, and how to address the challenges and opportunities ahead as the pace of change continues to accelerate.

      • 14:00
        GCC and the GNU Toolchain: The Essential Collection 20m

        David Edelsohn

      • 14:20
        Support for Control-flow Enforcement Technology 20m

        H.J. Lu

      • 14:40
        GLIBC API to access x86 specific platform features CPU run-time library for C 20m

        H.J.Lu

      • 15:00
        improve glibc and kernel iteration 30m

        Improve glibc and kernel iteration so we can get new features
        faster and more concise upstream.

      • 15:30
        Break 30m
      • 16:00
        RISC-V 32-bit time_t kernel ABI 20m

        Palmer/Atish

      • 16:20
        Toolchain plans for Armv8.5 20m

        Ramana Radhakrishnan

      • 16:40
        Building Linux kernel with other compilers 20m
    • 09:00 12:40
      BPF MC Pavillion-Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-C

      Sheraton Vancouver Wall Center

      58

      BPF is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP, tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.

      BPF has seen a lot of progress since last year's Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one.

      This year's BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year's event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.

      Official BPF MC website: http://vger.kernel.org/lpc-bpf.html

      • 09:00
        Scaling Linux Traffic Shaping with BPF 20m

        Google servers classify, measure, and shape their outgoing traffic. The original implementation is based on Linux kernel traffic control (TC). As server platforms scale so does their network bandwidth and number of classified flows, exposing scalability limits in the TC system - specifically contention on the root qdisc lock.

        Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and low-overhead prequeue ameliorate the contention up to a point. But they cannot fully resolve it. Recent changes to the Linux kernel make it possible to move classification, measurement, and packet mangling outside this critical section, potentially scaling to much higher rates while simultaneously shaping more flows and applying more flexible policies.

        By moving classification and measurement to BPF at the new TC egress hook, servers avoid taking a lock millions of times per second. Running BPF programs at socket connect time with TCP_BPF converts overhead from per-packet to per-flow. The programmability of BPF also allows us to implement entirely new functions, such as runtime configurable congestion control, first-packet classification and socket-based QoS policies. It enables faster deployment cycles and as this business logic can be updated dynamically from a user agent. The discussion will focus on our experience converting an existing traffic shaping system to a solution based on BPF, and the issues we’ve encountered during testing and debugging.

        Speakers: Willem de Bruijn (Google), Eddie Hao (Google), Vlad Dumitrescu (Google)
      • 09:20
        Compile-Once Run-Everywhere BPF Programs? 20m

        Compile-once and run-everywhere can make deployment simpler and may consume less resources on the target host, e.g., without llvm compiler and kernel devel package. Currently bpf programs for networking can compile once and run over different kernel versions. But bpf programs for tracing cannot since it accesses kernel internal headers and these headers are subject to change between kernel versions.

        But compile-once run-everywhere for tracing is not easy. BPF programs could access anything in the kernel headers, including data structures, macros and inline functions. To achieve this goal, we need (1) preserving header-level accesses for the bpf program, and (2) abstracting header info of vmlinux. Right before program load on the target host, some kind of resolution is done for bpf program against the running kernel so the resulted program is just like to that compiled against host kernel headers.

        In this talk, we will explore how BTF could be used by both bpf program and vmlinux to advance the possibility of bpf program compile-once and run-everywhere.

        Speakers: Yonghong Song (Facebook), Alexei Starovoitov (Facebook)
      • 09:40
        ELF relocation for static data in BPF 20m

        BPF program writers today who build and distribute their programs as ELF objects typically write their programs using one of a small set of (mostly) similar headers that establish norms around ELF section definitions. One such norm is the presence of a "maps" section which allows maps to be referenced within BPF instructions using virtual file descriptors. When a BPF loader (eg, iproute2) opens the ELF, it loads each map referred in this section, creates a real file descriptor for that map, then updates all BPF instructions which refer to the same map to specify the real file descriptor. This allows symbolic referencing to maps without requiring writers to implement their own loaders or recompile their programs every time they create a map.

        This discussion will take a look at how to provide similar symbolic referencing for static data. Existing implementations already templatize information such as MAC or IP addresses using C macros, then invoke a compiler to replace such static data at load time, at a cost of one compilation per load. By extending the support for static variables into ELF sections, programs could be written and compiled once then reloaded many times with different static data.

        Speakers: Joe Stringer (Cilium), Daniel Borkmann (Cilium)
      • 10:00
        BPF control flow, supporting loops and other patterns 20m

        Currently, BPF can not support basic loops such as for, while, do/while, etc. Users work around this by forcing the compiler to "unroll" these control flow constructs in the LLVM backend. However, this only works up to a point. Unrolling increases instruction count and complexity on the verifier and further LLVM can not easily unroll all loops. The result is developers end up writing code that is unnatural, iterating until they find a version that LLVM will compile into a form the verifier backend will support.

        We developed a verifier extension to detect bounded loops here,

        https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/log/?h=wip/bpf-loop-detection

        This requires building a DOM tree (computationally expensive) and then matching loop patterns to find loop invariants to verify loops terminate. In this discussion we would like to cover the pros and cons of this approach. As well as discuss another proposal to use explicit control flow instructions to simplify this task.

        The goal of this discussion would be to come to a consensus on how to proceed to make progress on supporting bounded loops.

        Speaker: John Fastabend (Cilium)
      • 10:20
        Efficient JIT to 32-bit architectures through data flow analysis 20m

        eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.

        In this talk, implementations for both versions of DF analyser will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.

        Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.

        Speaker: Jiong Wang (Netronome)
      • 10:40
        Morning Break 20m
      • 11:00
        eBPF Debugging Infrastructure - Current Techniques and Additional Proposals 20m

        eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.

        The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.

        Speaker: Quentin Monnet (Netronome)
      • 11:20
        eBPF-based tracing tools under 32 bit architectures 20m

        Complex software usually depends on many different components, which sometimes perform background tasks with side effects not directly visible to their users. Without proper tools it can be hard to identify which component is responsible for performance hits or undesired behaviors.

        We were challenged to implement D-Bus observability tools in embedded, ARM32 or ARM64 kernel based environments, both with 32-bit userspace. While we found bcc-tools, an open source compiler set useful, it appeared that it lacks support for 32-bit environments. We extended bcc-tools with support for 32-bit architectures. Using bcc-tools we created Linux eBPF programs – small programs written in a subset of C language, loaded from user-space and executed in kernel context. We attached them to uprobes and kprobes - user and kernel space special kinds of breakpoints. While it worked on ARM32 kernel based system, we faced another problem - ARM64 kernel lacked support for uprobes set in 32-bit binaries. The 64-bit ARM Linux kernel was extended with the ability to probe 32-bit binaries.

        We propose to discuss challenges we faced trying to implement bcc-tools based tracing tools on ARM devices. We present a working solution to overcome lack of support for 32-bit architectures in bcc-tools, leaving space for discussion about other ways to achieve the same result. We also introduce 32-bit instruction probing in ARM64 kernel - a solution that we found very useful in our case. As a proof of concept we present tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system with 32-bit userspace. We list what needs to be done for complete eBPF-based tools to be fully usable on ARM.

        Speakers: Maciej Slodczyk (Samsung), Adrian Szyndela (Samsung)
      • 11:40
        Using eBPF as a heterogeneous processing ABI 20m

        eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual machine, which can be used to execute simple programs injected by the user at various hooks in the kernel, on the occurrences of events such as incoming packets. eBPF was designed to simplify the work of in-kernel just-in-time compilers, i.e. translation of eBPF intermediate representation to CPU machine code. Upstream Linux kernel currently contains JITs for all major 64-bit instruction set architectures (ISAs) (x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit translators (ARM, x86, also NFP - Netronome Flow Processor).

        The eBPF generic virtual machine with clearly defined semantics makes it a very good vehicle for enabling programming of custom hardware. From storage devices to networking processors most host I/O controllers today are built based on or with accompaniment with general purpose processing cores, e.g. ARM. As vendors try to expose more and more capabilities of their hardware, using a general purpose machine definition like eBPF to inject code into hardware directly allows us to avoid creation of vendor specific APIs.

        In this talk I will describe the eBPF offload mechanism which exists today in the Linux kernel and how they compare to other offloading stacks e.g. for compute or graphics. I will present a proof-of-concept work on of reusing existing eBPF JITs for non-host architecture (e.g. ARM JIT on x86) to program a emulated device, followed by a short description of the eBPF offload for NFP hardware as an example of a real-life offload.

        Speaker: Jakub Kicinski (Netronome)
      • 12:00
        Traffic policing in eBPF: applying token bucket algorithm 10m

        eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.

        The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.

        Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.

        Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers. Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.

        • Note, that it's a replacement, not the alternative: eBPF based implementation doesn't assume putting packets into queues.
          ** Since the action is per packet, eBPF program should be attached to tc chain, it doesn't work with cgroups.
        Speaker: Julia Kartseva (Facebook)
      • 12:10
        In-kernel protocol aware filtering 10m

        Deep packet inspection seems to be a largely unexplored area of BPF use cases. The 4096 instruction limit and the lack of loops make such implementations non-straightforward for many protocols. Using XDP and socket filters, at Red Sift, we implemented DNS and TLS handshake detection to provide better monitoring for our clusters. We learned that while the protocol implementation is not necessarily straightforward, the BPF VM provides a reasonably safe environment for DPI-style parsing. When coupled with our Rust userspace implementation, it can provide information and functionality that previously would have required userspace intercepting proxies or middleboxes, at a comparable performance to iptables-style packet filters. Further work is needed to explore how we can turn this into a more comprehensive, active component, mainly due to the BPF VM restrictions around 4096 instruction programs.

        Speaker: Peter Parkanyi (Red Sift)
      • 12:20
        Enhancing User Defined Tracepoints 10m

        BPF trace tools such as bcc/trace and bpftrace can attach to Systemtap USDT (user application statically defined tracepoints) probes. These probes can be created by a macro imported from "sys/sdt.h" or by a provider file. Either way, Systemtap will register those probes as entries in the note section of the ELF file with the name of the probe, its address and the arguments as assembly locations. This approach is fairly simple, easy to parse and non-intrusive. Unfortunately, it is also obsolete and lacks features such as typed arguments and built-in dynamic instrumentation. Since BPF tools are growing in popularity, it makes sense to create a new enhanced format to fix these shortcomings.

        We can discuss and make decisions about the future of USDT probes used by BPF trace tools. Some possible alternatives are: extend Systemtap USDT to introduce these new features or extend kernel tracepoints so that user applications can also register them.

        Speaker: Matheus Marchini (Sthima)
      • 12:30
        Augmenting syscalls in 'perf trace' using eBPF 10m

        The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based 'strace' like tool, augmenting the syscall arguments provided by the tracepoints with integer->strings tables automatically generated from the kernel headers, showing the paths associated with fds, pid COMMs, etc.

        That is enough for integer arguments, pointer arguments needs either kprobes put in special locations, which is fragile and has been so far implemented only for getname_flags (open, etc filenames), or using eBPF to hook into the syscall enter/exit tracepoints to collect pointer contents right after the existing tracepoint payload.

        This has been done to some extent and is present in the kernel sources in the tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in perf to use BPF C programs as event names, automagically using clang/llvm to build and load it via sys_bpf(), 'perf trace' hooks this to the existing beautifiers that seeing that extra data use it to get the filename, struct sockaddr_in, etc.

        This was done for a bunch of syscalls, what is left is to get this all automated using BTF, allow passing filters attached to the syscalls, select which syscalls should be traced, use a pre-compiled augmented_syscalls.c just selecting what bits of the obj should be used, etc, i.e. the open issues about this streamlining process to avoid requiring the clang toolchain, etc will be the matter of this discussion.

        Speaker: Arnaldo Carvalho de Melo (Red Hat)
    • 09:00 12:30
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 09:00 12:30
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 09:00
        The hard work behind large physical allocations in the kernel 45m

        The physical memory management in the Linux kernel is mostly based on single page allocations, but there are many situations where a larger physically continuous memory needs to be allocated. Some are for the benefit of userspace (e.g. huge pages), others for better performance in the kernel (SLAB/SLUB, networking, and others).

        Making sure that contiguous physical memory is available for allocation is far from trivial, as pages are reclaimed for reuse roughly in last-recently-used (LRU) order, which is typically different from their physical placement. The freed memory is thus fragmented. The kernel has two complementary mechanisms to defragment the free memory. One is memory compaction, which migrates used pages to make the free pages contiguous. The other is page grouping by mobility, which tries to make sure that pages that cannot be migrated are grouped together, so the rest of pages can be effectively compacted. Both mechanisms employ various heuristics to balance the success of large allocations, and their overhead in terms of latencies due to processor and lock usage.

        The talk will discuss the two mechanisms, focusing on the known problems and their possible solutions, that have been proposed by several memory management developers.

        Speaker: Vlastimil Babka (SUSE)
      • 09:45
        WireGuard: Next-Generation Secure Kernel Network Tunnel 45m

        WireGuard [1] [2] is a new network tunneling mechanism written for
        Linux, which, after three years of development, is nearly ready for
        upstream. It uses a formally proven cryptographic protocol, custom
        tailored for the Linux kernel, and has already seen very widespread
        deployment, in everything from smart phones to massive data center
        clusters. WireGuard uses a novel timer mechanism to hide state from
        userspace, and in general presents userspace with a "stateless" and
        "declarative" system of establishing secure tunnels. The codebase is
        also remarkably small and has been written with a number of defense in
        depth techniques. Integration into the larger Linux ecosystem is
        advancing at a health rate, with recent patches for systemd and
        NetworkManager merged. There is also ongoing work into combining
        WireGuard with automatic configuration and mesh routing daemons on
        Linux. This talk will focus on a wide variety of WireGuard’s innards
        and tentacles onto other projects. The presentation will walk through
        WireGuard's integration into the netdev subsystem, its unique use of
        network namespaces, why kernel space is necessary is necessary, the
        various hurdles that have gone into designing a cryptographic protocol
        specifically with kernel constraints in mind. It will also examine a
        practical approach to formal verification, suitable for kernel
        engineers and not just academics, and connect the ideas of that with
        our extensive continuous integration testing framework across multiple
        kernel architectures and versions. As if that was not already enough,
        we will also take a close look at the interesting performance aspects
        of doing high throughput CPU-bound computations in kernel space while
        still keeping latency to a minimum. On the topic of smartphones, the
        talk will examine power efficiency techniques of both the
        implementation and of the protocol design, our experience in
        integrating this into Android kernels, and the relationship between
        cryptographic secrets and smartphones suspend cycles. Finally we will
        look carefully at the WireGuard userspace API and its usage in various
        daemons and managers. In short, this presentation will examine the
        networking and cryptography design, the kernel engineering, and the
        userspace integration considerations of WireGuard.

        [1] https://www.wireguard.com
        [2] https://www.wireguard.com/papers/wireguard.pdf

        Speaker: Jason Donenfeld
      • 10:30
        Break 30m
      • 11:00
        Recursive read deadlocks and Where to find them 45m

        Lockdep (the deadlock detector in the Linux kernel) is a powerful tool to detect deadlocks, and has been used for a long time by kernel developers. However, when comes to read/write lock deadlock detections, lockdep only has limited support. Another thing makes this limited support worse is some major architectures (x86 and arm64) has switched or is trying to switch its rwlock implementation to queued rwlock. One example is we found some deadlock cases that happened in kernel but we could not detect it with lockdep.

        To improve this situation, a patchset to support read/write deadlock detection in lockdep has been post to lkml and got to its v6. Althrough it got several positive feedbacks, some details about the reasoning of the correctness and other things still need more discussion.

        This topic will give a brief introduction on rwlock related deadlocks (recursive read deadlocks) and how we can tweak lockdep to detect them. It will focus on the detection algorithm and its correctness, but also some implementation details.

        This topic will provide the opportunity to discuss the reasoning and the overall design with some core lock developers, along with the opportunity to discuss the usage scenarios with potential users. The expected result is we have a cleaner plan on upstreaming this and more developers get educated on how to use this to help their work.

        Speaker: Boqun Feng
      • 11:45
        Enhancing perf to export processor hazard information 45m

        Most modern microprocessors employ complex instruction execution pipelines such that many instructions can be 'in flight' at any given point in time. The efficiency of this pipelining is typically measured in how many instructions get completed per CPU cycle and the metric gets variously called as Instructions Per Cycle (IPC) or the inverse metric Cycles Per Instruction (CPI). Various factors affect this metric and hazards are the primary among them. Different types of hazards exist - Data hazards, Structural hazards and Control hazards. Data hazard is the case where data dependencies exist between instructions in different stages in the pipeline. Structural hazard is when the same processor hardware is needed by more than one instruction in flight at the same time. Control hazards are more the branch misprediction kinds. Information about these hazards are critical towards analyzing performance issues and also to tune software to overcome such issues. Modern processors export such hazard data in Performance Monitoring Unit (PMU) registers. In this talk, we propose an arch neutral extension to perf to export the hazard data presented in different ways by different architectures. We also present how this extension has been applied to the IBM Power processor, the APIs and example output.

        Speaker: Mr. Madhavan Srinivasan (IBM Linux Technology Center)
    • 09:00 12:30
      Power Management and Energy-awareness MC Junior-Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior-Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The focus will be on power management frameworks, task scheduling in relation to power/energy optimization, and platform power management mechanisms. The goal is to facilitate cross framework and cross platform discussions that can help improve power and energy-awareness in Linux.

      • 09:00
        Energy-aware scheduling 25m

        An updated proposal for Energy Aware Scheduling has been posted and discussed on LKML during this year [1]. The patch set introduces an independent Energy Model framework holding active power cost of CPUs, and changes the scheduler's wake-up balancing code to use this newly available information when deciding on which CPU a task should run.

        This session aims at discussing the open problems identified during the review as well as possible improvements to other areas of the scheduler to further improve energy efficiency.

        [1] https://lore.kernel.org/lkml/20181016101513.26919-1-quentin.perret@arm.com/

        Speakers: Dietmar Eggemann (ARM), Quentin Perret (ARM)
      • 09:25
        Expressing per-task/per-cgroup performance hints 25m

        The Linux scheduler is able to drive frequency selection, when the schedutil cpufreq's governor is in use, based on task utilization aggregated at CPU level. The CPU utilization is then used to select the frequency which better fits the task's generated workload. The current translation of utilization values into a frequency selection is pretty simple: we just go to max for RT tasks or to the minimum frequency which can accommodate the utilization of DL+FAIR tasks.

        While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks we can aim at some better frequency driving which can take into consideration hints coming from user-space.

        Utilization clamping is a mechanism which allows to filter the utilization generated by RT and FAIR tasks within a range defined from user-space, either for a task or for task groups. The clamped utilization requirements of RUNNABLE tasks are aggregated at CPU level and used to enforce its minimum and/or maximum frequency.

        This session is meant to give an update on the most recent LKML posting of the utilization clamping patchset and to open a discussion on how to better progress this proposal.

        Speakers: Morten Rasmussen (Arm), Patrick Bellasi (Arm Ltd.)
      • 09:50
        Towards improved selection of CPU idle states 20m

        The venerable menu governor does some thigns that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (becuase it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors are not correlated to the list of available idle states in any way whatever and different correction factors are used depending on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular.

        A major rework of the menu governor would be required to address these issues and it is likely that the performance of some workloads would suffer from that. That raises the question of whether or not to try to improve the menu governor or to introduce an entirely new one to replace it, or to do both these things simultaneously.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 10:10
        Generic power domains (genpd) framework improvements 20m

        The Generic PM domains framework (genpd) keeps evolving to deal with new problems. Lately, we have for example seen genpd to incorporate support for active states power management and also support for multiple PM domains per device. Let's walk through these new changes that have been made and discuss their impact.

        Speaker: Ulf Hansson (Linaro)
      • 10:30
        Break 25m
      • 10:55
        Firmware interfaces for power management vs direct control of resources 25m

        While new technologies in platform power management continue to evolve, we need to look at ways to ensure it's independent of the OSPM. Custom vendor solutions for power management and device/system configuration lead to fragmentation.

        ACPI solved the problem for some market segments by abstracting details, but we still need an alternative for the traditional embedded/mobile market. ARM SCMI continues to address concerns in a few of these functional areas, but there still is a lot of resistance to move away from direct control of power resources in the OS. Examples include:

        a. Voltage dependencies for clocks (DVFS) [1] - genpd and performance domain integration
        b. Generic cpufreq governor for devfreq [2]
        c. On-chip interconnect API [3]

        This session aims at reaching some consensus and guidelines going forward to avoid further fragmentation.

        [1] https://www.spinics.net/lists/linux-clk/msg27587.html
        [2] https://patchwork.ozlabs.org/cover/916114/
        [3] https://patchwork.kernel.org/cover/10562761/

        Speaker: Sudeep Holla (ARM)
      • 11:20
        Runtime power sharing among CPUs, GPUs and others 25m

        Due to high performance demands systems tend to be over-provisioned, where it is not possible to run at peak power of each component. Even if each component has capability to report power and set power limits, there is no kernel level framework to achieve that. IPA addresses part of it, but on the systems in question thermal limits usually are not a problem, but sudden power overdraw is a bigger issue (particularly on unlocked systems). In addition, without proper power balance among components, they can starve each other. For example, in Intel KabyLake-G there are 4 big power consumers: CPUs, two GPUs and memory. If CPUs take most of the power, it will hurt graphics performance as the GPU will not be able to handle requests timely. So the power has to be managed at run time based on the workload demand.

        Speaker: Srinivas Pandruvada (Intel)
      • 11:45
        Runtime PM timer granularity issue 20m

        Runtime PM allows drivers to automatically suspend devices that have not been used for a defined amount of time. This autosuspend feature is really efficient for handling bursts of activity on a device by optimizing the number of runtime PM suspend/resume calls. However, the runtime PM timers used for that are fully based on jiffies granularity which raises problems for some embedded ARM platforms that want to optimize their energy usage as much as possible. For example, the minimum timeout value on arm64 is between 4 and 8 ms.

        The session will discuss the impact of switching runtime PM over to using hrtimers and a more fine graied time scale. It also will highlight the advantages and drawbacks of the changes relative to the current situation.

        Speaker: Vincent Guittot (Linaro)
      • 12:05
        On-chip Interconnect API Proposal 25m

        Modern SoCs have multiple CPUs and DSPs that generate a lot of data flowing through the on-chip interconnects. The topologies could be multi-tiered and complex. These buses are designed to handle use cases with high data throughput, but as the workload varies they need to be scaled to avoid wasting power. Furthermore, the priority between masters can vary depending on the running use case like video playback or CPU intensive tasks. The purpose of this new API is to allow drivers to express their QoS needs for interconnect paths between the various SoC components. The requests from drivers are aggregated and the system configures the interconnect hardware to the most optimal performance and power profile.

        The session will discuss the following:
        - How the consumer drivers can determine their bandwidth needs.
        - How to support different QoS configurations based on whether each CPU/DSP device is active or sleeping.

        Speaker: Vincent Guittot (Linaro)
    • 09:00 17:30
      RDMA MC Junior-Ballroom-C (Sheraton Vancouver Wall Center)

      Junior-Ballroom-C

      Sheraton Vancouver Wall Center

      67

      Remote DMA Microconference

      • 09:00
        Wellcome 20m

        Opening RDMA session with agenda, announcements and some statistics from last year.

        Speakers: Mr. Jason Gunthorpe, Leon Romanovsky
      • 09:20
        Container and namespaces for RDMA topics 40m
        • Remaining sticky situations with containers namespaces in sysfs and legacy all-namespace operation
        • Remaining CM issues * Security isolation problems
        Speakers: Doug Ledford, Parav Pandit
      • 10:00
        Remote page faults over RDMA 30m

        Discussion of the best way to govern 3rd party memory registration and if it is acceptable to implement RDMA-specific functionality (in this case, page fault handling) inside the kernel in order to avoid exposing additional interfaces.

        Speakers: Joel Nider, Mike Rapoport (IBM)
      • 10:30
        Break 30m
      • 11:00
        RDMA and get_user_pages 1h

        RDMA, DAX and persistant memory co-existence.

        Explore the limits of what is possible without using On
        Demand Paging Memory Registration. Discuss 'shootdown'
        of userspace MRs

        Dirtying pages obtained with get_user_pages() can oops ext4
        discuss open solutions.

        Speakers: Dan Williams (Intel Open Source Technology Center), Jan Kara, John Hubbard (NVIDIA), Mathew Wilcox
      • 12:00
        Very large Contiguous regions in userspace 30m

        Poor performance of get_user_pages on very large virtual ranges.
        No standardized API to allocate regions to user space
        Carry over from last year

        Speaker: Christopher Lameter (Jump Trading LLC)
      • 12:30
        Lunch 1h 30m
      • 14:00
        RDMA and PCI peer to peer 1h

        RDMA and PCI peer to peer transactions. IOMMU issues. Integration with HMM. How to expose PCI BAR memory to userspace and other drivers as a DMA target.

        Speaker: Stephen Bates
      • 15:00
        Improving testing of RDMA with syzkaller, RXE and Python 30m

        Problem solve RDMA's distinct lack of public tests.
        Provide a better framework for all drivers to test with, and a framework for basic testing in userspace.

        Worst remaining unfixed syzkaller bugs and how to try to fix them.

        Speakers: Jason Gunthorpe, Noa Osherovich
      • 15:30
        Break 30m
      • 16:00
        IOCTL conversion and new kABI topics 30m

        Attempt to close on the remaining tasks to complete the project.

        Restore fork() support to userspace

        Speaker: Jason Gunthorpe (Mellanox Technologies)
      • 16:30
        RDMA BoF and Closing Session 1h

        Let's gather together and try to plane next year.

        Speakers: Mr. Jason Gunthorpe (Mellanox Technologies), Leon Romanovsky
    • 09:00 12:30
      RISC-V MC Pavillion-Ballroom-D (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-D

      Sheraton Vancouver Wall Center

      77

      The momentum behind RISC-V ecosystem is really commendable and its open nature has a large role in its growth. It allowed contributions from both academic and industry community leading to an unprecedented number of hardware designs proposals in a very short span of time. Soon, a wider variety of RISC-V based hardware boards and extensions
      will be available, allowing a larger choice of applications not limited to embedded micro-controllers. RISC-V software ecosystem also need to grow across the stack so that RISC-V can be a true alternative to existing ISA. Linux kernel support holds the key in this.

      The primary objective of the RISC-V track at Plumbers to initiate a community wide discussion about the design problems/ideas for different Linux kernel features implemented or to be implemented. We believe that this will also result in significant increase in active developer participation in code review/patch submissions which will definitely lead to a better & stable kernel for RISC-V.

      • 10:30
        Break 30m
    • 14:00 16:45
      Kernel Summit Track Junior-Ballroom-D (Sheraton Vancouver Wall Center)

      Junior-Ballroom-D

      Sheraton Vancouver Wall Center

      67
    • 14:00 16:45
      LPC Main Track Pavillion-Ballroom-AB (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-AB

      Sheraton Vancouver Wall Center

      35
      • 14:00
        A practical introduction to XDP 45m

        The eXpress Data Path (XDP) has been gradually integrated into the Linux kernel over several releases. XDP offers fast and programmable packet processing in kernel context. The operating system kernel itself provides a safe execution environment for custom packet processing applications, in form of eBPF programs, executed in device driver context. XDP provides a fully integrated solution working in concert with the kernel's networking stack. Applications are written in higher level languages such as C and compiled via LLVM into eBPF bytecode which the kernel statically analyses for safety, and JIT translates into native instructions. This is an alternative approach, compared to kernel bypass mechanisms (like DPDK and netmap).

        This talk gives a practical focused introduction to XDP. Describing and giving code examples for the programming environment provided to the XDP developer. The programmer need to change their mindeset a bit, when coding for this XDP/eBPF execution environment. XDP programs are often split between eBPF-code running kernel side and userspace control plane. The control plane API not predefined, and is up to the programmer, through userspace manipulating shared eBPF maps.

        Speakers: Jesper Dangaard Brouer (Red Hat), Mr. Andy Gospodarek (Broadcom)
      • 14:45
        Task Migration at Scale Using CRIU 45m

        The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.

        Speakers: Mr. Victor Marmol (Google), Mr. Andy Tucker (Google)
      • 15:30
        Break 30m
      • 16:00
        Migrating to Gitlab 45m

        Over the past few years the graphics subsystem has been spearheading experiments in running things differently: Pre-merge CI wrapped around mailing lists using patchwork, committer model as a form of group maintainership on steroids, and other things. As a result the graphics people have run into some interesting new corner cases of the kernel's "patches carved on stone tablets" process.

        On the other hand the freedesktop.org project, which provides all the server infrastracture for the graphics subsystem, is undergoing a big reorganization of how they provide their services. The biggest change is migrating all source hosting over to a gitlab instance.

        This talk will go into the why of these changes and detail what is definitely going to change, and what is being looked into more as experiments with open outcomes.

        Speaker: Daniel Vetter (Intel)
    • 14:00 17:00
      Live kernel patching MC Junior-Ballroom-AB (Sheraton Vancouver Wall Center)

      Junior-Ballroom-AB

      Sheraton Vancouver Wall Center

      100

      The main purpose of the Linux Plumbers 2018 Live kernel patching miniconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make Live patching of the Linux Kernel (more or less) feature complete.

      The main purpose of the proposed miniconference is focusing on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.

      • 14:00
        User space live patching (libpulp) 15m
        Speaker: Joao Moreira
      • 14:15
        Livepatch patch creation tooling 15m
        Speaker: Nicolai Stange
      • 14:30
        Livepatch callback state management 15m
        Speaker: Nicolai Stange
      • 14:45
        Livepatch is too flexible 15m
        Speaker: Josh Poimboeuf (Red Hat)
      • 15:00
        Livepatch stable trees 15m
        Speaker: Jason Baron
      • 15:15
        Elivepatch - flexible distributed live patch generation 15m
        Speaker: Alice Ferrazzi
      • 15:30
        Break 30m
      • 16:00
        GCC optimizations and their impacts on live patching 15m
        Speaker: Miroslav Benes
      • 16:15
        Livepatch s390x consistency model 10m
        Speaker: Joe Lawrence
      • 16:25
        Livepatch arm64 support 10m
        Speaker: Torsten Duwe
      • 16:35
        Objtool powerpc support 10m
        Speaker: Kamalesh Babulal
    • 14:45 16:45
      BPF MC Pavillion-Ballroom-C (Sheraton Vancouver Wall Center)

      Pavillion-Ballroom-C

      Sheraton Vancouver Wall Center

      58

      BPF is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP, tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.

      BPF has seen a lot of progress since last year's Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one.

      This year's BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year's event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.

      Official BPF MC website: http://vger.kernel.org/lpc-bpf.html

Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×